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1.  Introduction 

^  This  is  the  final  report  on  research  in  the  system  architecture  of  accelerators  for  the  high  perfor¬ 

mance  execution  of  logic  programs.  It  was  conducted  by  the  Electrical  Engineering  -  Systems 
Department  of  the  University  of  Southern  California,  under  award  number  25879  as  subcontractor 
to  Ae  University  of  California,  Berkeley.  The  research  was  sponsored  by  the  Defense  Advanced 
Research  Projects  Agency  under  contract  number  N0(X)14-88-K-0579. 

}  The  scope  of  this  work  included: 

•  Design  of  an  abstract  machine  for  the  execution  of  Prolog,  the  Berkeley  Abstract  Machine 
(BAM). 

•  Design,  simulation,  and  implementation  of  a  high-performance  VLSI  Prolog  accelerator 

j  chip,  the  VLSI-BAM. 

•  A  simulator  for  the  Aquarius-II  multiprocessor. 

•  Release  of  version  1.0  of  the  Berkeley  Extended  Prolog  (BXP)  compiler. 

•  Design,  implementation,  evaluation,  and  release  of  the  Advanced  Silicon-Compiler  in 

^  Prolog  (ASP)  System. 

All  of  the  above  work  was  completed,  as  reported  in  the  following  section  of  this  report. 

It  was  originally  proposed  that  this  work  would  include  the  design  and  performance  evaluation  of 
the  Aquarius-n  and  Aquarius-in  multiprocessors,  under  options  A-II  and  A-DI.  As  these  options 
were  not  funded,  the  research  was  not  performed. 
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2.  Accomplishments 


2.1  Aquarius  Prolog  Compiler 

Our  work  on  compilation  of  Prolog  revealed  that  the  language  can  be  implemented  an  order  of 
magnitude  more  efficiently  that  the  best  existing  systems,  with  the  result  Aat  its  speed  approaches 
that  of  imperative  languages  such  as  C  for  a  significant  class  of  programs.  The  approach  used  was 
to  encode  each  occurrence  of  a  general  feature  of  Prolog  as  simply  as  possible.  Tlie  design  of  this 
system,  Aquarius  Prolog,  is  based  upon  four  principles: 

•  Reduce  instruction  granularity.  Use  an  execution  model,  the  Berkeley  Abstract  Machine 
(see  below),  that  retains  the  good  features  of  the  Warren  Abstract  Machine  (WAM). 

•  Exploit  determinism.  Compile  deterministic  programs  with  efficient  conditional  branches. 
Most  predicates  written  by  human  programmers  are  deterministic,  yet  previous  systems 
often  compile  them  in  an  inefficient  manner  by  simulating  conditional  branching  with 
backtracking. 

•  Specialize  unification.  Compile  unification  to  the  simplest  possible  code.  Unification  is  a 
general  pattern-matching  operation  that  can  do  many  things  in  the  implementation:  pass 
parameters,  assign  values  to  variables,  allocate  memory,  and  do  conditional  branching. 

•  Dataflow  analysis.  Derive  type  information  by  global  dataflow  analysis  to  support  the 
above  ideas. 

The  resulting  Aquarius  Prolog  system  (Appendix  1)  is  about  five  times  faster  that  the  high-per¬ 
formance  commercial  Quintus  Prolog  compiler.  Because  of  limitation  of  the  dataflow  analysis 
system,  Aquarius  is  not  yet  competitive  with  the  C  language  for  all  programs.  This  can  be 
addressed  in  future  work. 

2.2  Berkeley  Abstract  Machine  (BAM) 

The  design  of  the  Berkeley  Abstract  Machine  (BAM)  was  based  upon  the  Programmed  Logic 
Machine  (PLM),  which  was  a  straightforward  microcoded  implementation  of  the  Warren  Abstract 
Machine,  the  most  widely-used  model  for  the  execution  of  Prolog.  Studies  of  the  PLM  found  that 
perfomiance  was  limited  by  bus  bandwidth.  It  also  proved  difficult  to  perform  compiler  optimiza¬ 
tions  on  PLM  code  because  of  the  complexity  of  the  operations.  These  problems  were  addressed 
in  the  BAM  design. 

The  BAM  began  with  a  general-purpose  RISC  architecture  and  added  a  minimal  set  of  extensions 
to  support  high-performance  Prolog  execution.  Exploiting  these  features  requited  simultaneous 
development  of  the  architecture  and  an  optimizing  compiler.  While  most  Prolog-specific  opera¬ 
tions  can  be  done  in  software,  a  crucial  set  of  features  that  must  be  supported  by  the  hardware  in 
order  to  achieve  the  highest  performance: 

•  Tagging  of  data,  with  tags  kept  in  the  upper  four  bits  of  a  32-bit  word. 

•  Segmented  virtual  addressing. 

•  Separate  instruction  and  data  buses,  with  the  data  bus  being  double-width. 

•  Special  instructions  which  can  also  be  used  in  implementing  other  languages. 

•  Instructions  to  test  and  manipulate  tags. 
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•  Unification  support 

The  results  of  this  study  showed  that  the  special  architectural  features  added  10.6%  to  the  active 
area  of  the  BAM  chip,  while  increasing  performance  by  70%.  This  study  is  presented  in  detail  in 
Appendix  2,  “Fast  Prolog  With  an  Extended  General  Purpose  Architecture.” 

2.3  Advanced  Silicon-Compiler  in  Prolog  (ASP) 

The  Advanced  Silicon-Compiler  in  Prolog  (ASP)  is  a  full-range  hardware  synthesis  system.  The 
goal  of  ASP  is  to  synthesize  a  single-chip  VLSI  processor  from  a  high-level  specification  of  the 
ISA.  The  approach  is  to  study  a  specialized  vertical  slice  of  the  design  space.  The  design  of  the 
system  proceeds  hierarchically.  At  each  level,  many  choices  are  considered  for  each  component, 
making  it  convenient  to  consider  the  process  as  a  conversion  of  a  conceptual  AND-OR  tree  into 
an  AND  tree,  with  design  decisions  being  the  choice  of  a  particular  OR  branch. 

Conceptually,  each  level  of  abstraction  is  composed  of  a  simulator  module,  a  compiler  module,  a 
design  program  (engine)  module,  and  a  knowledge  base.  Each  level  accepts  a  specification  in  a 
formal  specialized  language  and  produces  a  more  detailed  and  concrete  specification  in  a  different 
specialized  language.  To  determine  which  design  choices  should  be  made,  a  benchmark  program 
is  provided  to  each  level  to  that  the  developing  architecture  can  be  simulated  and  measured  rela¬ 
tive  to  the  design  choice. 

ASP  is  a  design  automation  (DA),  as  opposed  to  a  computer-aided  design  (CAD)  system.  In  it,  the 
silicon  compilation  problem  is  divided  into  three  major  problem  domains,  behavioral,  logic,  and 
circuit  The  geometric  domain  is  concerned  with  the  lowest  level  of  design,  the  efficient  layout  on 
silicon  of  a  particular  logic  design.  The  logic  domain  produces  that  logic  design,  given  a  behav¬ 
ioral  (or  register  transfer  level  -  RTL)  design.  At  the  highest  level,  the  behavioral  domain  gener¬ 
ates  a  behavioral  description  of  a  particular  ISA. 

A  summary  of  ASP  is  presented  in  Appendix  3,  “A  CAD  Design  Environment  Based  Upon  Pro¬ 
log.” 


2.4  Aquarius-II  Simulator 

As  a  first  step  toward  a  Prolog  multiprocessor,  we  developed  the  NuSim  simulator  to  serve  as  a 
testbed  for  new  ideas.  Based  upon  the  VLSI-PLM,  NuSim  provides  a  framework  that  permits 
simulation  at  many  levels,  from  the  instruction  set  to  the  memory  architecture  (including  caches 
and  coherency  protocols).  The  simulator’s  flexibility  allows  extensive  instrumentation  and  con¬ 
tinual  updates  and  changes. 

NuSim  is  an  event-driven  simulator,  with  the  events  being  memory  accesses  ordered  by  time. 
This  technique  simulates  a  multiprocessor  using  a  uniprocessor.  The  simulator  consists  of  16,000 
lines  of  C  code  and  two  small  machine-dependent  routines  to  save  and  restore  the  coroutine 
stacks.  It  is  fairly  portable,  currently  running  under  4.3  BSD  Unix  on  the  VAX  785  and  the  Sun  3, 
and  under  System  V  Unix  on  an  Intel  396-based  personal  computer. 

In  Appendix  4,  ‘The  Validation  of  a  Multiprocessor  Simulator,”  we  report  on  validating  NuSim 
with  respect  to  the  VPSim  uniprocessor  simulator. 
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3.  Summary 

Under  this  subcontract,  the  University  of  Southern  California  has  performed  research  in  accelera¬ 
tors  for  the  high-performance  execution  of  Prolog  programs,  including  compilation  techniques, 
accelerator  architecture,  multiprocessor  design,  and  application  to  design  automation. 

In  particular,  this  project  included  the  design  and  implementation  for  a  microprocessor  for  the 
high-performance  execution  of  Prolog,  implementation  of  a  simulator  for  the  Aquarius-II  multi¬ 
processor,  release  of  the  Aquarius  Prolog  Compiler,  and  design,  evaluation,  and  release  of  the 
ASP  System. 
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Can  Logic  Programming  Execute  as  Fast  as  Imperative  Programming? 


P«cf  Lcxlewijk  Vao  Roy 
ABSTRACT 

The  purpose  of  this  dissertation  is  to  provide  construciive  proof  that  the  logic  programming  language 
Prolog  can  be  implemented  an  order  of  magnitude  more  efficiently  than  the  best  previous  systems,  so  that 
its  speed  approaches  imperative  languages  such  as  Q  for  a  significant  class  of  problems.  The  driving  force 
in  the  design  is  to  encode  each  occurrence  of  a  general  feature  of  Prolog  as  simply  as  possible.  The  result¬ 
ing  system.  Aquarius  Prolog,  is  about  five  times  faster  than  Quintus  Prolog,  a  high  performance  commer- 

t 

cial  system,  on  a  set  ef  representative  programs.  The  design  is  based  on  the  following  ideas; 

(1)  Reduce  instruction  granularity.  Use  an  execution  model,  the  Berkeley  Abstiaa  Machine  (BAM), 
that'ret^ns  the  good  features  of  the  Warren  Abstract  Machine  (WAM),  a  standard  execution  model 
for  Prolog,  but  is  more  easily  optimized  and  closer  to  a  real  machine. 

(2)  Exploit  determinism.  Compile  determiiustic  programs  with  efficient  conditional  branches.  Most 
predicates  written  by  human  programmers  are  deterministic,  yet  previous  systems  often  compile 
them  in  an  inefficient  manner  by  simulating  conditional  branching  with  backnacking. 

(3)  Specialize  unification.  Compile  unification  to  the  simplest  possible  code.  Unification  is  a  general 
pattern-matching  operation  that  can  do  many  things  in  the  implementation;  pass  parameters,  assign 
values  to  variables,  allocate  memory,  and  do  conditional  branching. 

(4)  Dataflow  analysis.  Derive  type  information  by  global  dataflow  analysis  to  support  these  ideas. 

Because  of  limitations  of  the  dataflow  analysis,  the  system  is  not  yet  competitive  with  the  C  language  for 
all  programs.  I  outline  the  work  that  is  needed  to  close  the  remaining  gap. 
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Chapter  1 
Introduction 


“You’re  given  the  form, 

but  you  have  to  write  the  sonnet  yourself. 

What  you  say  is  completely  up  to  you." 

-  Madeleine  L’Englc,  A  Wrinkle  In  Time 

1.  Thesis  statement 

The  purpose  of  this  dissertation  is  to  provide  constructive  proof  that  the  logic  programming  language 

Prolog  can  be  implemented  an  order  of  magnitude  more  efficiently  than  the  best  previous  systems,  so  that 

its  speed  approaches  imperative  languages  such  as  C  for  a  significant  class  of  problems. 

< 

The  motivation  for  logic  programming  is  to  let  programmers  describe  whai  they  want  separately 
from  how  to  get  it.  It  is  based  on  the  insight  that  any  algorithm  consists  of  two  parts:  a  logical  specification 
(the  logic)  and  a  description  of  how  to  execute  this  specification  (the  control).  This  is  summarized  by 
Kowalski’s  well-known  equation  Algorithm  =  Logic  +  Cbnuol  (401.  Logic  programs  arc  statements 
describing  properties  of  the  desired  result,  with  the  control  supplied  by  the  underlying  system.  The  hope  is 
that  much  of  the  conuol  can  be  automatically  provided  by  the  system,  and  that  what  remains  is  cleanly 
separated  from  the  logic.  The  descriptive  power  of  this  approach  is  high  and  it  lends  itself  well  to  analysis. 
This  is  a  step  up  from  programming  in  imperative  languages  Gtke  C  or  Pascal)  because  the  system  takes 
care  of  low-level  details  of  how  to  execute  the  statements. 

Many  logic  languages  have  been  proposed.  Of  these  the  most  popular  is  Prolog,  which  was  origi¬ 
nally  created  to  solve  problems  in  natural  language  understanding.  It  has  successful  commercial  imple¬ 
mentations  and  an  acuve  user  community.  Programming  it  is  well  understood  and  a  consensus  has 
developed  regarding  good  programming  style.  The  semantics  of  Prolog  strike  a  balance  between  efficient 
implementation  and  logical  completeness  {42,82].  It  auempts  to  make  programming  in  a  subset  of  first- 
order  logic  practical.  It  is  a  naive  theorem  prover  but  a  useful  programming  language  because  of  its 
mathematical  foundation,  its  simplicity,  and  its  efficient  implementation  of  the  powerful  concepts  of 
unification  (pauem  matching)  and  search  (backtracking). 
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Prolog  is  being  applied  in  such  diverse  areas  as  expen  systems,  natural  language  understanding, 
theorem  proving  (57],  deductive  databases.  CAD  tool  design,  and  compiler  writing  [22].  Examples  of  suc¬ 
cessful  applications  arc  AUNT,  a  universal  nctlist  translator  [59],  Chat-80,  a  natural  language  query  system 
(81],  and  diverse  in-house  expen  systems  and  CAD  tools.  Grammars  based  on  unification  have  become 
popular  in  natural  language  analysis  (55,56].  Imponant  work  in  the  area  of  languages  with  implicit  paral¬ 
lelism  is  based  on  variants  of  Prolog  Our  research  group  has  used  Prolog  successfully  in  the  development 
of  tools  for  architecture  analysis  ( 12. 16, 35).  jn  compilation  (19.73, 76],  and  in  silicon  compilation  (llj. 

Prolog  was  developed  in  the  early  70’s  by  Colmerauer  and  his  associates  (38],  This  early  system 
was  an  interpreter.  David  Warren’s  work  in  the  late  70’s  resulted  in  the  first  Prolog  compiler  [80].  The 
syntax  and  scmpnucs  of  this  compiler  have  become  the  dc  facto  standard  in  the  logic  programming  com¬ 
munity,  commonly  known  as  the  Edinburgh  standard.  Warren's  later  work  on  Prolog  implementation  cul¬ 
minated  in  the  development  of  the  Warren  Abstract  Machine  (WAM)  in  1983  [82],  an  execution  model  that 
has  become  a  standard  for  Prolog  implemcnution. 

However,  these  implementations  are  an  order  of  magnitude  slower  than  imperative  languages.  As  a 
result,  the  practical  application  of  logic  programming  has  reached  a  crossroads.  On  the  one  hand,  it  could 
degenerate  into  an  interesting  academic  subculture,  with  little  use  in  the  real  world.  Or  it  could  flourish  as 
a  practical  tool.  The  choice  between  these  two  directions  depends  crucially  on  improving  the  execution 
efficiency.  Theoretical  and  experimental  work  suggests  that  this  is  feasible— that  it  is  possible  for  an 
implementation  of  Prolog  to  use  the  powerful  features  of  logic  programming  only  where  they  are  needed. 

Therefore  I  propose  the  following  thesis: 

A  program  written  in  Prolog  can  execute  as  efficiently  as  its  imple¬ 
mentation  in  an  imperative  language.  This  relies  on  the  development 
of  four  principles: 

(1)  An  instruction  set  suitable  for  optimization. 

(2)  Techniques  to  exploit  the.determinism  in  programs. 

(3)  Techniques  to  specialize  unification. 

(4)  A  global  dataflow  analysis. 


3 


2.  The  Aquarius  compiler 

I  have  tested  this  thesis  by  constnicung  a  new  optimizing  Prolog  compiler,  the  Aquarius  compiler. 

» 

The  design  goals  of  the  compiler  are  (in  decreasing  order  of  imponance); 

(1)  High  performance.  Compiled  code  should  execute  as  fast  as  possible. 

(2)  Portability.  The  compiler’s  output  instruction  set  should  be  easily  retargctabic  to  any  sequential 
architecture. 

(3)  Good  programming  style.  The  compiler  should  be  written  in  Prolog  in  a  modular  and  declarative 
style.  There  are  few  large  Prolog  programs  that  have  been  written  in  a  declarative  style.  The  com¬ 
piler  will  be  an  addition  to  that  set. 

I  justify  the  four  principles  given  in  the  thesis  statement  in  the  light  of  the  compiler  design: 

(1)  Reduce  instruction  granularity.  To  generate  efficient  code  it  is  necessary  to  use  an  execution 
model  and  insuuction  set  that  allows  extensive  optimization.  I  have  designed  the  Berkeley  Abstract 
Machine  (BAM)  which  retains  the  good  features  of  the  Warren  Abstract  Machine  (WAM)  (82J, 
namely  the  dau  suocturcs  and  execution  model,  but  has  an  instnictian  set  closer  to  a  sequential 
machine  architecture.  This  makes  it  easy  to  optimize  BAM  code  as  well  as  port  it  to  a  sequential 
architecture. 

(2)  Exploit  determinism.  The  majority  of  predicates  written  by  human  programmers  are  intended  to  be 
executed  in  a  deterministic  fashion,  that  is,  to  give  only  one  solution.  These  predicates  are  in  effea 
case  statements,  yet  systems  loo  often  compile  them  inefficiently  by  using  backtracking  to  simulate 
conditional  branching.  It  is  important  to  replace  backtraddng  by  conditional  blanching. 

(3)  Specialize  unification.  Unification  is  the  foundation  of  Prolog.  It  is  a  general  pauem-matching 
operation  that  can  match  objects  of  any  size.  Its  logical  semantics  coneqtond  to  many  possible 
actions  in  an  implementation,  including  passing  parameters,  assigning  values  to  variables,  allocating 
memory,  and  conditional  branching.  Often  only  one  of  these  actions  is  needed,  and  it  is  important  to 
simplify  the  general  mechanism.  For  example  :  of  the  most  common  actions  is  assigning  a  value 
to  a  variable,  which  can  often  be  simplified  to  a  single  load  or  store. 
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(4)  Dataflow  analysis.  A  global  dataflow  analysis  supports  techniques  to  exploit  determinism  and  spe¬ 
cialize  unification  by  deriving  information  about  the  program  at  compile-time.  The  BAM  instruction 

a 

SCI  is  designed  to  express  the  optimizations  possible  by  these  techniques. 

Simultaneously  with  the  compiler,  our  research  group  has  developed  a  new  architecture,  the  VLSI-BAM, 
and  its  implementation.  The  first  of  several  target  machines  for  the  compiler  is  the  VLSI-BAM.  The 
interaction  between  the  architecture  and  compiler  design  has  significantly  improved  both.  This  dissertation 
describes  only  the  Aquarius  compiler.  A  description  of  the  VLSI-BAM  and  a  cost/bcnefit  analysis  of  its 
features  is  given  elsewhere  (34.351. 

3.  Structure  of  the  dissertation 

The  structure  of  the  dissertation  mirrors  the  structure  of  the  compiler.  Figure  1.1  gives  an  overview 
of  this  structure.  Chapter  2  summarizes  the  Prolog  language  and  previous  techniques  for  its  high  perfor¬ 
mance  execution.  Chapters  3  through  6  describe  and  justify  the  design  of  the  compiler  in  depth.  Chapter  3 
discusses  its  two  internal  languages:  kernel  Prolog,  which  is  close  to  the  source  program,  and  the  BAM. 
which  is  close  to  machine  code.  Chapter  4  gives  the  optimizing  transformations  of  kernel  Prolog.  Chapter 
5  gives  the  compilation  of  kernel  Prolog  into  BAM.  Qiaptcr  6  gives  the  optimizing  transformations  of 
BAM  code.  Chapter  7  does  a  numerical  evaluation  of  the  compiler.  It  measures  its  performance  on  several 
machines,  docs  an  analysis  of  the  effectiveness  of  its  optimizations,  and  briefly  compares  its  performance 
with  the  C  language.  Finally,  chapter  8  gives  concluding  remarks  and  suggestions  for  further  wwk. 

The  appendices  give  details  about  various  aspects  of  the  compiler.  Appendix  A  is  a  user  manual  for 
the  compiler.  Appendices  B  and  C  give  a  formal  definition  of  BAM  syntax  and  semantics.  Appendix  D  is 
an  English  description  of  BAM  semantics.  Appendix  E  describes  the  extended  DCG  notation,  a  tool  that  is 
used  throughout  the  compiler's  impicmenution.  Appendix  F  lists  the  source  code  of  the  C  and  Prolog 
benchmarks.  Appendix  G  lists  the  souree  code  of  the  compiler. 
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4.  Contributions 


4.1.  Demonstration  of  high  performance  Prolog  exMution 

A  demonstration  that  the  combination  of  a  new  abstract  machine  (the  BAM),  new  compilation  tech¬ 
niques,  and  a  global  dataflow  analysis  gives  an  average  speedup  of  five  times  over  Quintus  Prolog  (58J,  a 
high  performance  commercial  system  based  on  the  WAM.  This  speedup  is  measured  with  a  set  of 
medium-sized,  realistic  Prolog  programs.  For  small  programs  the  dataflow  analysis  does  better,  resulting  in 
an  average  speedup  of  closer  to  seven  times.  For  programs  that  use  built-in  predicates  in  a  realistic 
manner,  -the  average  speedup  is  about  four  times,  since  built-in  predicates  are  a  fixed  cost.  The  programs 
for  which  dataflow  analysis  provides  sufficient  information  arc  competitive  in  speed  with  a  good  C  com- 
pilcr. 

On  the  VLSI-BAM  processor,  programs  compiled  with  the  Aquarius  compiler  execute  in  1/3  the 
cycles  of  the  PLM  (28],  a  special-purpose  architecture  implementing  the  WAM  in  microcode.  Static  code 
size  is  three  times  the  PLM.  which  has  byte-coded  instructions.  The  WAM  was  implemented  on  SPUR,  a 
RISC-like  architecture  with  extensions  for  Lisp  (81.  by  macro-expansion.  Ptograms  compiled  with 
Aquarius  execute  in  1/7  the  cycles  of  this  implementation  with  1/4  the  code  size  (34]. 

4.2.  Test  of  the  thesis  statement 

A  ie.st  of  the  thesis  that  Prolog  can  execute  as  efficiently  as  an  imperative  language.  The  resulu  of 
this  test  are  only  partially  successful.  Performance  has  been  significantly  increased  over  previous  Prolog 
implementations;  however  the  system  is  competitive  with  imperative  languages  only  for  problems  for 
which  dataflow  analysis  is  able  to  provide  sufficient  information.  This  is  due  to  the  following  factors: 

•  I  have  imposed  restrictions  on  the  dataflow  analysis  to  make  it  practical.  As  programs  become 
larger,  these  restrictions  limit  the  quality  of  ^  results. 

•  The  fragility  of  Prolog:  minor  changes  in  program  text  often  greatly  alter  the  efficiency  with  which 
the  program  executes.  This  is  due  to  the  undcr-spccificatioa  of  many  Prolog  programs.  i.e.  their  logi¬ 
cal  meaning  rules  out  computations  but  the  compiler  cannot  deduce  all  cases  where  this  happens. 


For  example,  oficn  a  program  is  deterministic  (docs  not  do  backtracking)  even  though  the  compiler 

cannot  figure  it  out.  This  can  result  in  an  enormous  difference  in  performance:  often  the  addition  of 

* 

j 

a  single  cut  operation  or  type  declaration  reduces  the  time  and  space  needed  by  orders  of  magnitude. 

•  The  creation  and  modification  of  large  data  objects.  The  compilation  of  single  assignment  semantics 
into  destructive  assignment  (instead  of  copying)  in  the  implementation,  also  known  as  the  copy 
avoidance  problem,  is  a  special  ease  of  the  general  problem  of  efficiently  representing  time  in  logic. 
A  quick  solution  is  to  use  nonlogical  built-in  predicates  such  as  setarg/3  163).  A  better  solution 
based  on  dataflow  analysis  has  not  yet  been  implemented. 

•  Prolog’s  apparent  need  for  architectural  support  A  general-purpose  architecture  favors  the  imple¬ 
mentation 'of  an  imperative  language.  To  do  a  fair  comparison  between  Prolog  and  an  imperative 
language,  one  must  take  the  architecture  into  account.  For  the  VLSI-BAM  processor,  our  research 
group  has  analyzed  the  costs  and  benefits  of  one  carefully  chosen  set  of  architectural  extensions. 
With  a  5%  increase  in  chip  area  there  is  a  50%  increase  in  Prolog  performance. 

4J.  Development  of  a  new  abstract  machine 

The  development  of  a  liew  abstract  machine  for  Prolog  implementation,  the  Berkeley  Abstract 
Machine  (BAM).  This  abstract  machine  allows  more  optimization  and  gives  a  better  match  to  general- 
purpose  architectures.  Its  execution  flow  and  data  structures  are  similar  to  the  WAM  but  it  conuins  an 
instruction  set  that  is  much  closer  to  the  architecture  of  a  real  machine.  It  has  been  designed  to  allow 

o 

extensive  low-level  optimization  as  well  as  compact  encoding  of  operations  that  arc  common  in  Prolog. 
The  BAM  includes  simple  instructions  (register-transfer  operations  for  a  lagged  architecture),  complex 
instructions  (frequently  needed  complex  operations),  and  embedded  information  (allows  better  translation 
to  the  assembly  language  of  the  target  machine).  BAM  code  is  designed  to  be  easily  ported  to  general- 
purpose  architectures.  It  has  been  ported  to  several  plaifomts  including  the  VLSI-BAM,  the  SPARC,  the 


MIPS,  and  the  MC68020. 
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4.4.  Development  of  the  Aquarius  compiler 

The  development  of  the  Aquarius  compiler,  a  compiler  for  Prolog  into  BAM.  The  compiler  is 
sufficiently  robust  that  it  is  used  routinely  for  large  programs.  The  compiler  has  the  following  distinguish¬ 
ing  features; 

•  It  is  written  in  a  modular  and  declarative  style.  Global  information  is  only  used  to  hold  information 
about  compiler  options  and  type  declarations. 

•  It  represents  types  as  logical  formulas  and 'uses  a  simple  form  of  deduction  to  propagate  information 
and  improve  the  generated  code.  'This  extends  the  usefulness  of  dauflow  analysis,  which  derives 
information  about  predicates,  by  propagating  this  information  inside  of  predicates. 

•  It  is  designed  to  exploit  as  much  as  possible  the  type  information  given  in  the  input  and  extended  by 
the  dataflow  analyzer. 

•  It  incorporates  general  techniques  to  generate  efficient  deterministic  code  and  to  encode  each 
occurrence  of  unification  in  the  simplest  possible  form. 

•  It  supports  a  class  of  simplified  unbound  variables,  called  unwitialized  variables,  which  are  cheaper 
to  create  and  bind  than  standard  variables. 

The  compiler  development  proceeded  in  parallel  with  the  development  of  a  new  Prolog  system,  Aquarius 
Prolog  (31  ].  For  portability  reasons  the  system  is  written  completely  in  Prolpg  and  BAM  code.  The  Prolog 
component  is  carefully  coded  to  make  the  most  of  the  optimizations  offered  by  the  compiler. 

4  J.  Development  of  a  global  dataflow  aiulyzer 

The  development  of  a  global  dataflow  analyzer  as  an  integral  pan  of  the  compiler.  The  analyzer  has 
the  following  propcnics: 

•  It  uses  abstract  intcrpretat'ion  on  a  lattice.  Abstract  inierprciation  is  a  general  technique  that  proceeds 
by  mapping  the  values  of  variables  in  the  program  to  a  (possibly  finite)  set  of  descriptions.  Execu¬ 
tion  of  the  program  over  the  descriptions  completes  in  finite  time  and  gives  information  about  the 
execution  of  the  original  program. 
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•  I(  derives  a  small  set  of  types  that  lets  the  compiler  simplify  common  Prolog  operations  such  as  vari¬ 
able  binding  and  unification.  These  types  are  uninitialized  variables,  ground  tcims,  nonvariable 
terms,  and  recursively  dereferenced  terms.  On  a  representative  set  of  Prolog  programs,  the  analyzer 
finds  nontrivial  types  for  56%  of  predicate  arguments;  on  average  23%  arc  uninitialized  (of  which 
one  third  arc  passed  in  registers),  21%  arc  ground.  10%  arc  nonvariables,  and  17%  arc  recursively 
dereferenced.  The  sum  of  these  numbers  is  greater  Uian  56%  because  arguments  can  have  multiple 
types. 

•  It  provides  a  significant  improvement  in  performance,  reduction  in  static  code  size,  and  reduction  in 
thc'Prolog-specific  operations  of  trailing  and  dereferencing.  On  a  representative  set  of  Prolog  pro¬ 
grams,  an^ysis  reduces  execution  time  by  18%  and  code  size  by  43%.  Dereferencing  is  reduced 
from  1 1  %  to  9%  of  execution  time  and  trailing  is  reduced  from  2.3%  to  1.3%  of  execution  time. 

•  It  is  limited  in  several  ways  to  make  it  practical.  Its  type  domain  is  small,  so  it  is  not  able  to  derive 
many  useful  types.  It  has  no  explicit  represenution  for  aliasing,  which  occurs  when  two  temis  have 
variables  in  common.  This  simplifies  implementation  of  the  analy^,  but  sacrifices  potentially  useful 
information. 

4.6.  Development  of  a  tool  for  applicative  programming 

The  development  of  a  language  extension  to  Prolog  to  simplify  the  implemenution  of  large  applica¬ 
tive  programs  (Appendix  E).  The  extension  generalizes  Prolog’s  Definite  Clause  Grammar  (DCG)  notation 
to  allow  programming  with  multiple  named  accumulators.  A  preprocessor  has  been  written  and  used 
extensively  in  the  implementation  of  the  compiler. 
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Chapter  2 

Prolog  and  Us  High  Performance  Execution 

This  chapter  gives  an  overview  of  the  features  of  the  Prolog  language  and  an  idea  of  what  it  means  to 
program  in  logic.  It  summarizes  previous  work  in  its  compilation  and  the  possibilities  of  improving  its  exe¬ 
cution  efficiency.  It  concludes  by  giving  an  overview  of  related  work  in  the  area  of  high  performance  Pro¬ 
log  implemenuiiion. 

1.  The  Prolog  language 

This  section  gives  a  brief  introduction  to  the  language.  It  gives  an  example  Prolog  program,  and 
goes  on  to  summarize  the  data  objects  and  control  flow.  The  syntax  of  Prolog  is  defined  in  Figure  2.2  and 
the  semantics  arc  defined  in  Figure  2.3  (section  2.1).  Sterling  and  Shapiro  give  a  more  detailed  account  of 
both  (62],  as  do  Pereira  and  Shicber  (56]. 

A  Prolog  program  is  a  set  of  clauses  (logical  sentences)  written  in  a  subset  of  fiist-ordcr  logic  called 
Horn  clause  logic,  which  means  that  they  can  be  interpreted  as  //-statements.  A  predicate  is  a  set  of 
clauses  that  defines  a  relation.  i.e.  all  the  clauses  have  the  same  name  and  arity  (number  of  arguments). 
Predicates  are  often  referred  to  by  the  pair  name/arity.  For  example,  the  predicate  in_tree/2 
defines  membership  in  a  binary  tree: 

in_tree(X,  tree {X, . 

in_tree<X,  t ree(V, Left. Right ) )  ;-X<V.  in_tree(X.  Left). 

in_tree(X,  tree (V, Left, Right) )  ;-X>V,  in_tree(X.  Right). 

(Here  ”  means  if,  the  comma  *'  means  and,  variables  begin  with  acapital  letter,  tree  (V,  L,  R) 
is  a  compound  objea  with  three  fields,  and  the  underscore  **_"  is  an  anonymous  variable  whose  value  is 
ignored.)  In  English,  the  definition  of  in_tree/2  can  be  interpreted  as:  “X  is  in  a  tree  if  it  is  equal  to 
the  node  value  (first  clause),  or  if  it  is  less  than  the  node  value  and  it  is  in  the  left  subtree  (second  clause), 
or  if  it  is  greater  than  the  node  value  and  it  is  in  the  right  subtree  (third  clause).” 

The  definition  of  in_tree/2  is  directly  executable  by  Prolog.  Depending  on  which  arguments 
arc  inputs  and  which  arc  outputs.  Prolog's  execution  mechanism  will  execute  the  definition  in  different 
ways.  The  definition  can  be  used  to  verify  that  X  is  in  a  given  tree,  or  to  insert  or  look  up  X  in  a  tree. 
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The  cxccuiion  of  Prolog  proceeds  as  a  simple  theorem  prover.  Given  a  query  and  a  set  of  clauses, 
Prolog  aticmpis  to  construct  values  for  the  variables  in  the  query  that  make  the  query  true.  Execution 
proceeds  depth-first,  i.e.  clauses  in  the  program  arc  tried  in  the  order  they  are  listed  and  the  predicates 
inside  each  clause  (called  goals)  arc  invoked  from  left  to  right.  This  strict  order  imposed  on  the  execution 
make  ‘\'olog  rather  weak  as  a  theorem  prover,  but  useful  as  a  programming  language,  especially  since  it 
can  be  implemented  very  efficiently,  much  more  so  than  a  more  general  theorem  prover. 

1.1.  Data 

The. data  objects  and  their  manipulation  arc  modeled  after  first  order  logic. 

« 

1.1.1.  The  logical  variablot 

A  variable  represents  any  data  object.  Initially  the  value  of  the  variable  is  unknown,  but  it  may 
become  known  by  instantiation.  A  variable  may  be  instantiated  only  once,  i.e.  it  is  single-assignment. 
Variables  may  be  bound  to  other  variables.  When  a  variable  is  instantiated  to  a  value,  this  value  is  seen  by 
all  the  variables  bound  to  it.  Variables  may  be  passed  as  predicate  arguments  or  as  arguments  of  com¬ 
pound  data  objects.  The  latter  case  is  the  basis  of  a  powerful  programming  technique  based  on  partial  dau 
structures  which  are  filled  in  by  different  predicates. 

1.1.2.  Dynamic  typing 

Compound  data  types  arc  first  class  objects.  i.c.  new  types  can  be  created  at  tun-time  and  variables 
can  hold  values  of  any  type.  Common  types  are  atoms  (unique  constants.  e.g.  too,  abed ),  integers,  lists 
(denoted  with  square  brackets,  e.g.  (HeadlTailj,  (a,b/C,d]),  and  structures  (e.g. 
tree(X,L,R),  quad(X,C,B,F) ).  Structures  are  similar  to  C  structs  or  Pascal  records — they  have  a 
name  (called  ihc  functor)  and  a  fixed  number  of  arguments  (called  the  ariiy).  Atoms,  integers,  and  lists  arc 
used  also  in  Lisp. 
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s(X,  Y,  a)  X=Z  X=a 

Y=b  =►  Y=b 

s(Z,  b,  Z)  '  a=Z  Z=a 

1 

Figure  2.1  -  An  example  of  unification 

I.I.3.  Unification 

Unification  is  a  pattern-matching  operation  Urn  finds  the  most  general  common  instance  of  two  dau 
objects.  A  formal  definition  of  unification  is  given  by  Lloyd  (42).  Unification  is  able  to  match  compound 
dau  objects  of  any  size  in  a  single  primitive  opemion.  Binding  of  variables  is  done  by  unification.  As  a 
pan  of  matching,  the  variables  in  the  terms  are  insuniiated  to  make  them  equal.  For  example,  unifying 
s(X,Y,a)  and  s(Z,b,Z)  CFigure  2.1)  matches  X  with  Z,  Y  with  b,  and  a  with2L  The  unified  term 
is  s  (a,  b,  a) .  Y  isequal  to  b. and  both  X  and  Z are  equal  to  a. 

1.2.  Control 

During  execution,  Prolog  auempts  to  satisfy  the  clauses  in  the  order  they  arc  listed  in  the  program. 
When  a  predicate  with  more  than  one  clause  is  invoked,  the  system  remembers  this  in  a  choice  point.  If  the 
system  cannot  make  a  clause  true  (i-C-  execution  fails)  then  it  backtracks  to  the  most  recent  dioice  point 
(i.e.  it  undoes  any  work  done  trying  to  satisfy  that  clause)  and  tries  the  next  clause.  Any  bindings  made 
during  the  attempted  execution  of  the  clause  arc  undone.  Executing  the  next  clause  may  give  variables  dif¬ 
ferent  values.  In  a  given  execution  path  a  variable  may  have  only  one  value,  but  in  different  execution 
paths  a  variable  may  have  different  values.  Prolog  is  a  single-assignment  language:  if  unification  attempts 
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to  give  a  variable  a  difrerent  value  then  failure  causes  backtracking  to  occur.  For  example,  trying  to  unify 
s  (a,  bland  s(X,X)  will  fail  bccau.se  the  constants  a  and  b  arc  not  equal. 

There  arc  four  features  that  arc  used  to  manage  the  control  flow.  These  arc  the  "cut”  operation 
(denoted  by  “  !  "  in  programs),  the  disjunction,  the  if-ihen-cisc  consuTici,  and  ncgation-as-failurc. 

1.2.1,  The  cut  operation 

The  cut  operation  is  used  to  manage  backtracking.  A  cut  in  the  body  of  an  clause  effectively  says: 
“This  clause  is  the  correct  choice.  Do  not  try  any  of  the  following  clauses  in  this  predicate  when  back¬ 
tracking.”  Executing  a  cut  has  the  same  effect  in  forward  execution  as  executing  true,  i.e.  it  has  no 
effect.  But  it  alters  the  backtracking  behavior.  For  example: 

p(A)  q(A) ,  ! ,  r (A) . 

plA)  s(A) . 

During  execution  of  p(A),if  q(A)  succeeds  then  the  cut  is  executed,  which  removes  the  choice  points 
created  in  q  (A)  as  well  as  the  choice  point  created  when  p  (A)  was  invoked.  As  a  result,  if  r  (A) 
fails  then  the  whole  predicate  p(A)  fails.  If  the  cut  were  not  there,  then  if  r(A)  fails  execution  back¬ 
tracks  first  to  q  (A) ,  and  if  that  fails,  then  it  backtracks  further  to  the  second  clause  of  p  (A) ,  and  only 
when  s  ( A)  in  the  second  clause  fails  docs  the  whole  predicate  p  (A)  fail. 

1.2.2.  The  disjunction 

A  disjunction  is  a  concise  way  to  denote  a  choice  bctw’een  several  alternatives.  It  is  less  verbose  than 
defining  a  new  predicate  that  has  each  alternative  as  a  separate  clause.  For  example: 

q(A)  (  A-a  ;  A-b  ;  A-c  ). 

This  predicate  returns  the  three  solutions  a.  b.and  c  on  backtracking.  It  is  equivalent  to: 

q(a)  . 
q(b)  . 
q(c) . 
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1.2.3.  If-then-else 

The  if-ihen-cisc  consmict  is  used  to  denote  a  selection  between  two  alternatives  in  a  clause  when  it  is 

$ 

known  that  if  one  alternative  is  chosen  then  the  other  will  not  be  needed.  For  example,  the  predicate 
p  (A)  above  can  be  written  as  follows  with  an  if-thcn<lsc: 

p(A)  (  q(A)  ->  r(A)  ;  s (A)  ) . 

This  has  identical  semantics  as  the  first  definition.  The  arrow  ->  in  an  if-ihen-eisc  acts  as  a  cut  that 
removes  choice  points  back  to  the  point  where  the  if-thcn-eisc  starts. 

1.2.4.  Negation-as-failure 

< 

Negation  in  Prolog  is  implemented  by  negation-as-failure,  denoted  by  \+  (Goal) .  This  is  not  a 
true  negation  in  the  logical  sense  so  the  symbol  \->-  is  chosen  instead  of  not.  A  negated  goal  succeeds  if 
the  goal  itself  fails,  and  fails  if  the  goal  succeeds.  For  example: 

r(A)  \+  t  (A)  . 

The  predicate  c(A)  will  succeed  only  if  t  (A)  fails.  This  has  identical  semantics  as: 

r (A)  t (A) ,  ! ,  fail . 
r<A)  . 

In  other  words,  if  t  (A)  succeeds  then  the  fail  causes  failure,  and  the  cut  ensures  that  the  second 
clause  is  not  tried.  If  t(A)  fails  (hen  the  second  clause  is  tried  because  the  cut  is  not  executed.  Note  that 
negation-as-failure  never  binds  any  of  the  variables  in  the  goal  that  is  negated.  This  is  different  from  a 
purely  logical  negation,  which  must  return  all  results  that  arc  not  equal  to  the  ones  that  satisfy  the  goal. 
Negation-as-failure  is  sound  (ix.  it  gives  logically  conect  results)  if  the  goal  being  negated  has  no  unbound 
variables  in  it 

IJ.  Syntax 

Figure  2.2  gives  a  Prolog  definition  of  the  syntax  of  a  clause.  The  definition  docs  not  present  the 
names  of  the  primitive  goals  that  are  part  of  the  sy^cm  (c.g.  arithmetic  or  symbol  tabic  manipulation). 
These  primitive  goals  arc  called  “built-in  predicates.”  They  arc  defined  in  the  Aquarius  Prolog  user 


clause {HJ  head(H). 

clause { (H; -B) )  head(H),  body(B). 

head(H)  qoal_term<H) . 

body(G)  control  (G,  A.  B) ,  body (A),  body(B|. 
body(G)  goal (G) , 

goal  (G)  \+control(G,  .  goal_terjn(G)  . 

control  (  (A;B) ,  A,  B)  .  , 

control ( (A. B) ,  A,  B) . 
control { (A->B) .A.  B) . 
control (\+ (A) ,  A,  true). 

tertn(T)  var  (T)  . 

term(T)  gbal_tenn  (T)  . 

goal_terin(T)  nonvar(T),  functor<T,  A),  tenn_args(l.  A,  T)  . 
term_args (I,  A,  _)  I>A. 

te6m_args(I,  A,  T)  1-<A,  arg{I,  T,  X),  terin(X),  11  is  I+l,  tenn_args  (II,  A,  T) . 

%  Built-in  predicates  needed  in  the  definition: 
functor (T,  F,  A)  (TennT has (iuicior F sndsrity A), 
argd,  T,  X)  (Argument  I  of  compound  term  T  is  X) . 
var(T)  (Argument  T  is  «n  unbound  variable), 

nonva  r  (T )  :  -  (Argument  T  is  a  nonvariable}. 


Figure  2.2  -  The  syntax  of  Prolog 


manual  (31  ].  The  figure  defines  the  syntax  after  a  clause  has  already  been  read  and  converted  to  Prolog’s 
internal  form.  It  assumes  that  lexical  analysis  and  parsing  have  already  been  done.  Features  of  Prolog  that 
depend  on  (he  exact  form  of  (he  input  (i.e.  operators  and  the  exaa  format  of  auxns  and  variables)  are  rxn 
defined  here. 

To  understand  this  definition  it  is  necessary  to  understand  the  four  built-in  predicates  that  it  uses. 
The  predicates  functor  (T,  F,  A)  and  argd,  T,  X)  are  used  to  examine  compound  terms. 
The  predicates  var  (T)  and  nonvar  (T)  arc  opposites  of  each  other.  Their  meaning  is  straightfor¬ 
ward;  they  check  whether  a  term  T  is  unbound  or  bound  to  a  nonvariabic  term.  For  example,  var  (_) 
succeeds  whereas  var  (foot  11  docs  not. 


2.  The  principles  of  high  performance  Prolog  execution 


The  first  implementation  of  Prolog  was  dcvcIopQd  by  Colmeraucr  and  his  associates  in  France  as  a 

t 

by-product  of  research  into  natural  language  understanding.  This  implementation  was  an  interpreter.  The 
first  Prolog  compiler  wa.s  developed  by  David  Warren  in  1977.  Somewhat  later  Warren  developed  an  exe¬ 
cution  model  for  compiled  Prolog,  the  Warren  Abstract  Machine  (WAM)  (82).  This  was  a  major  improve¬ 
ment  over  previous  models,  and  it  has  become  the  dc  facto  standard  implementation  technique.  The  WAM 
defines  a  high-level  instruction  set  that  corresponds  closely  to  Prolog. 

This  section  gives  an  overview  of  the  operational  semantics  of  Prolog,  the  principles  of  the  WAM,  a 

summary  of  its  instruction  set.  and  how  to  compile  Prolog  into  it.  For  more  detailed  information,  please 
* 

consult  Maier  &  Warren  [43]  or  Ait-Kaci  (!].  The  execution  model  of  the  Aquarius  compiler,  the  BAM 
(Chapter  3),  uses  data  structures  similar  to  those  of  the  WAM  and  has  a  similar  control  flow,  although  its 
instruction  set  is  different. 

2.1.  Operational  semantics  of  Prolog 

This  section  summarizes  the  operational  semantics  of  Prolog.  It  gives  a  precise  statement  of  how 
Prolog  executes  without  going  into  details  of  a  particular  implementation.  This  is  useful  to  separate  the 
execution  of  Prolog  from  the  many  optimizations  that  are  done  in  the  WAM  and  BAM  execution  models. 
This  section  may  be  skipped  on  first  reading. 

Figure  2.3  defines  the  semantics  of  Prolog  as  a  simple  resolution-based  theorem  prover.  For  clarity, 
the  definition  has  been  limited  in  the  following  ways:  It  does  not  assume  any  particular  representation  of 
terms.  It  does  not  show  the  implementation  of  cut,  disjunctions,  if-ihen-else,  negaiion-as-faiiure.  or  built-in 
predicates.  It  assumes  that  variables  ate  renamed  when  necessary  to  avoid  conflicts.  It  assumes  that  failed 
unifications  do  not  bind  any  variables.  It  assumes  also  that  the  variable  bindings  formed  in  successful 
unifications  arc  accumulated  until  the  end  of  the  computation,  so  that  the  final  bindings  give  the  computed 
answer. 

Terminology:  A  goal  C  is  a  predicate  call,  which  is  similar  to  a  procedure  call.  A  resolvent  /I  is  a 
list  of  goals  I  f7 1 .  G  2 . C,  ].  The  query  Q  is  the  goal  that  starts  the  execution.  The  program  is  a  list  of 


function  prolog_cxccutc(C)  :  goal) ;  boolean; 
var 

B  :  Slack  of  pair  (lisi  of  goal,  integer);  /*  the  backtrack  stack  •/ 

R  :  list  of  goal;  /•  the  resolvent  */ 

I  :  integer;  /*  index  into  program  clauses  •/ 

begin 

/?  :=  I  0  1; 

B  :=  empty; 
push  (/?,!)  on  fl; 

\^hile  true  do  begin 

/*  Control  step:  find  next^lausc.  •/ 

if  cmpiy(fi )  (hen  return  false  else  pop  B  into  (/?  ,i ); 

if  (/?  =  { 1)  then  return  true; 

if  (i+1  </t)  then  push  (/?.<+l)on 5; 

/•  Resolution  step:  try  to  unify  with  the  clause.  */ 

.  *  /•  At  this  point.  ^  {  G  . . G,  1  and  A,  =  (//.  A.  i . A„  )  */ 

/•  Unify  the  first  goal  in  R  with  clause  A, .  V 
unify  G  i  and  W, ; 

if  successful  unification  then  begin 
.  /*  In  /? .  replace  G  j  by  the  body  of  A,  •/ 

/*  If  A,  docs  not  have  a  body,  then  R  is  shortened  by  one  goal  */ 

R  I  A,  1  ,  ,  Am  ,  C  2  »  —.  »  Cr  ], 

push  (R  .1)  on  B  T  proceed  to  next  goal  */ 
end 
end 

end; 

Figure  2.3  -  Operational  definition  of  Prolog  execution 

clauses  I  A I ,  A  2 .  —  ,  A,  ].  The  number  of  clauses  in  the  program  is  denoted  by  n .  Each  clause  A,  has  a 
head  //,  and  an  optional  body  given  as  a  list  of  goals  |  A,i ,  Ai2 .  — .  A^  ]■ 

Execution  starts  by  seuing  the  iniual  resolvent  R  to  contain  the  query  goal  (2 .  In  a  resolution-based 
theorem  prover,  the  resolvent  is  transformed  in  successive  steps  until  (1)  it  becomes  empty,  in  which  case 
execution  succeeds,  (2)  all  the  clause  choices  are  exhausted,  in  which  case  execution  fails,  or  (3)  the  pro¬ 
gram  goes  into  an  infinite  loop.  In  a  single  tiansfonnaiion  step,  a  goal  G  is  taken  from  the  current  resol¬ 
vent  R  and  unified  with  a  clause  in  the  program.  The  next  resolvent  is  obtained  by  replacing  G  by  the 
body  of  the  clause. 

This  process  is  nondctcrministic,  and  much  work  has  been  done  in  the  area  of  automatic  theorem 
proving  to  reduce  the  size  of  its  search  space  (7).  To  get  efficiency,  the  ]q)proach  of  Prolog  is  to  restrict  the 
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ptxKcss  in  iwo  ways:  by  always  taking  the  first  goal  from  R  and  by  trying  clauses  in  the  order  they  arc 
listed  in  the  program  (Figure  2.3).  If  no  successful  match  is  found,  then  the  program  backtracks — a  previ¬ 
ous  resolvent  is  popped  off  the  backuack  stack  and  execution  continues.  Therefore  the  execution  flow  of 
Prolog  is  identical  to  that  of  a  procedural  language,  with  the  added  ability  to  backtrack  to  earlier  execution 
States. 

The  function  prolog_cxecuic(0 )  returns  a  boolean  that  indicates  whctlier  execution  was  successful 

or  not  (Figure  2.3).  If  execution  was  successful,  then  there  is  a  set  of  bindings  for  the  variables  in  Q  that 

gives  the  result  of  the  computation.  As  a  definition.  proI<^exccuic(Q )  faithfully  mirrors  the  execution  of 

Prolog.  As  an  implementation,  however,  it  is  incredibly  inefficient.  For  each  clause  that  is  tried,  it  pushes 
< 

and  pops  the  complete  resolvent  (which  can  be  very  large)  on  the  backtrack  stack.  The  backtrack  stack 
grows  with  each  successful  resolution  step.  A  practical  implementation  avoids  much  of  this  overhead. 

•  The  next  section  describes  the  WAM.  an  execution  model  that  is  much  more  efficient.  In  the  WaM. 
the  resolvents  arc  stored  in  a  corhpaci  form  on  several  stacks.  Only  the  differences  between  successive 
resolvents  arc  stored,  so  that  memory  usage  is  much  less.  The  stack  discipline  is  used  to  make  backtrack¬ 
ing  efficient.  The  WAM  also  defines  a  representation  for  data  items  that  allows  an  efficient  implemenution 
of  unification. 

2.2.  Principles  of  (he  WAM 

The  WAM  defines  a  mapping  between  the  terminology  of  logic  and  of  a  sequential  machine  (Figure 
2.4).  Predicates  correspond  to  procedures.  Procedures  are  always  wiiticn  as  one  large  case  siaiemenL 
Clauses  correspond  to  the  arms  of  this  case  staiemem.  The  scope  of  variable  names  is  a  single  clause. 
(Global  variables  exist;  however  (heir  use  is  ineffic^t  and  is  discouraged.)  Goals  in  a  clause  correspond  to 
calls.  Unification  corresponds  to  parameter  passing  and  assignment.  Tail  recursion  corresponds  to  itera¬ 
tion.  Feauires  that  do  not  map  directly  arc  the  single-assignment  nature  and  altering  backuacking  behavior 
with  the  cut  operation. 

The  WAM  is  based  on  four  ideas:  use  tagged  pointers  to  represent  dynamically  t)'pcd  dau.  optimize 
backtracking  (exploit  determinism  by  doing  a  conditional  branch  on  the  first  argument),  spociali/.c 


Prolog 


Imperative  language 


set  of  clauses 

predicate;  set  of  clauses 
with  same  name  and  arity 

clause;  axiom 

goal  invocation 
unification 

backtracking 

« 

logical  variable 
tail  recursion 

Figure  2.4  -  Mapping  between  Prolog  and  an  impeniive  language  (according  to  WAM) 

unification  (instead  of  compiling  a  general  unification  algorithm,  compile  instructions  that  unify  with  a 
known  term),  and  map  the  execution  of  Prolog  to  a  real  machine.  The  WAM  defines  a  high-level  instruc¬ 
tion  set  to  represent  these  operations. 

2.2.1.  Implementation  of  dynamic  typing  with  tags 

Data  is  represented  by  objects  that  fit  in  a  register  and  consist  of  two  pans;  the  tag  field  (which  gives 
the  type)  and  the  value  field  (Figure  2.5).  The  value  field  is  used  for  different  purposes  in  different  types:  it 
gives  the  value  of  integers,  the  address  of  variables  and  compound  tcims  0>sts  and  smicuires).  and  it 
ensures  that  each  atom  has  a  unique  value  differeitt  from  all  other  atoms.  Unbound  variables  are  imple¬ 
mented  as  self-referential  pointers  (that  is.  they  point  to  themselves)  or  as  pointers  to  other  unbound  vari¬ 
ables.  The  semantics  of  unification  allow  variables  to  be  unified  together,  so  that  they  have  identical  values 
from  then  on.  In  (he  implementaiion,  such  variables  can  point  to  other  variables.  Therefore  retrieving  (he 
value  of  a  variable  requires  following  this  pointer  chain  to  its  end,  an  operation  called  dertferencing. 
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Atom 


Integer 


Structure 


List* 


•  Variable 


Figure  2.5  -  Representation  of  Prolog  tenns  in  WAM  and  BAM 


2.2.2.  Exploit  determinism 


It  is  often  possible  to  reduce  the  number  of  clauses  of  a  predicate  that  must  be  vied.  The  WAM  has 
instructions  that  hash  on  the  value  of  the  first  argument  and  do  a  four-way  branch  on  the  ug  of  the  first 
argument.  These  instructions  avoid  the  execution  of  clauses  that  could  not  possibly  unify  with  the  goal. 
The  four-way  branch  distinguishes  between  the  four  data  types—variabies,  constants  (atoms  and  integers), 
lists  (cons  cells),  and  structures.  The  hashing  instructions  hash  into  tables  of  constants  and  tables  of  struc¬ 
tures.  For  example: 


w««)(  (monday)  . 
w««k (tuesday) . 
w«ck (Wednesday) . 
week (thursday) . 
week (f riday) . 
week (Saturday) . 
week (Sunday) . 
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This  is  a  set  of  seven  clauses  wilh  constant  arguments.  If  the  argument  X  of  the  call  week(X)  is  aeon- 
sum,  then  at  most  one  clause  can  unify  successfully  with  it.  Hashing  is  used  to  pick  that  clause.  If  X  is  an 
unbound  variable  then  no  such  optimization  is  possible  and  all  clauses  arc  tried  in  order. 

2.2  J.  Specialize  unification 

Most  uses  of  unification  arc  special  eases  of  the  general  unification  algorithm  and  can  be  compiled  in 
a  simpler  way  using  information  known  at  compilc-time.  For  example,  consider  the  following  clause 
which  is  part  of  a  queue-handling  package; 

queue  (X,0)  is  true 

%  if  Q  is  a  queue  containing  the  single  element  X. 

1 

queu'etX,  q(s(0)  ,  (XIO.O)  . 

A  queue  is  represented  here  as  a  compound  term.  The  complexity  of  this  term  is  typical  of  real  programs. 
In  the  WAM,  a  unification  in  the  source  code  is  compiled  into  a  sequence  of  high-level  iitsmictions.  The 
compiled  code  executes  as  if  the  original  clause  had  been  defined  as  follows,  with  the  nested  term  q/3 
completely  unraveled; 

queuetX,  Q)  Q-q(A,B,C|,  A“s(0),  B-[X|C]. 

(The  notation  P”Q  means  to  unify  the  two  terms  P  and  Q.)  The  compiled  code  is; 


procedure  queue/2 

qet_structure 

q/3. r(l) 

\  <J-q( 

<-  Start 

unification 

of 

q/3 

unify_v«xiable 

r(2) 

%  A. 

unify_variable 

r(3) 

\  B. 

unify_variable 

r«) 

%  C) 

get_struv.ture 

s/l,r(2) 

%  A-3( 

<-  Start 

unification 

of 

s/1 

uni£y_constant 

0 

%  01 

get  list  r(3) 

%  B- 

<-  Start 

unification 

of 

list 

unify_value  rtO) 

'  (X 

unify  value  r(<) 

%  ICl 

proceed 

t 

<-  Return  to  caller 

(r  (0)  and  r  (1)  arc  registers  holding  the  arguments  X  and  Q.  and  r  (21  .  r  (3)  ....  are  temporary 
registers.)  Unification  of  the  nested  suucture  is  expanded  into  a  sequence  of  operations  that  do  special 
cases  of  the  general  algorithm.  These  operations  are  encapsulated  in  the  get  and  unify  insuaciions. 
Unification  has  two  modes  of  operation;  it  can  take  apart  an  existing  structure  or  it  can  create  a  new  one. 
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In  the  WAM,  (he  decision  which  mcxle  to  use  is  made  at  run-time  in  the  get  instructions  by  checking  the 
type  of  the  object  being  unified.  A  mode  flag  is  set  which  affects  the  actions  of  the  following  unify 

a 

instructions  (up  to  (he  next  get ).  A  more  detailed  overview  of  the  WAM  instruction  set  is  given  in  sec¬ 
tion  2.3  below. 

2^.4.  Map  e.xecution  to  a  real  machine 

The  control  flow  of  Prolog  is  mapped  to  multiple  stacks.  The  stack  representation  holds  the  resol¬ 
vents  in  a  form  that  makes  each  resolution  step  as  efficient  as  a  procedure  call  in  an  imperative  language. 
The  stack-based,  suucturc  allows  fast  recovery  of  memory  on  backtracking.  As  a  result,  some  applications 
do  not  need  a  garbage  collector. 

A  further  optimization  maps  Prolog  variables  to  registers.  The  variables  in  a  clause  are  partitioned 

into  three  classes  (icmpo.ar>,  permanent,  and  void)  depending  on  their  lifetimes.  Void  variables  have  no 

* 

lifetime  and  need  no  storage.  Temporary  variables  do  not  need  to  survive  across  procedure  calls,  so  they 
can  be  stored  in  machine  registers.  Petmaneni  variables  are  stored  in  environments  (i.e.  stack  frames)  local 
to  a  clause. 

2J.  Description  of  the  WAM 

The  previous  section  gave  an  overview  of  the  ideas  in  the  WAM,  with  a  simple  example  of  generated 
code.  This  section  complaes  that  description  by  presenting  the  data  storage,  execution  state,  and  instruc¬ 
tion  sa  of  the  WAM  in  full.  It  also  gives  a  larger  example  of  generated  code  and  a  scheme  to  compile  Pro¬ 
log  into  WAM. 

2J3.1.  Memory  areas 

Memory’  of  the  WAM  is  divided  into  six  logical  areas  (Figure  2.6):  three  stacks  for  the  data  objects, 
one  stack  to  support  unification,  one  stack  to  support  the  interaction  of  unification  and  backtracking,  and 
one  area  as  code  space. 

(1)  The  global  stack.  This  stack  is  also  known  a.s  the  heap,  although  it  follows  a  stack  discipline.  This 
stack  holds  terms  (lists  and  structures,  the  compound  dau  of  Prolog). 


Three  kinds  of  dau  objeqs  on  stacks 


r(e)  r(b)  r(h) 


execution  environment  choice  point  global  stack  trail  push-down 

state  suck  suck  (heap)  suck  suck 


Figure  2.6  -  Dau  struaures  of  WAM  and  BAM 

(2)  The  environment  stack.  This  suck  holds  environments  (i.e.  local  frames)  which  contain  variables 
local  to  a  clause.  Because  of  backtracking  (control  may  return  to  a  clause  whose  environment  is 
deep  inside  the  suck),  this  area  does  not  follow  a  strict  suck  discipline,  however,  convention  has 
kept  this  naming.  (The  other  sucks  in  the  WAM  do  follow  a  stack  discipline.) 

(3)  The  choice  point  stack.  Also  known  as  the  backtrack  suck,  this  suck  holds  choice  points,  dau 
objects  similar  to  closures  that  encapsulate  the  execution  sute  for  backtracking. 

(4)  The  trail.  The  uail  suck  is  used  to  save  locations  of  bound  variables  that  have  to  be  unbound  on 
backuacking.  Saving  variables  is  called  trailing,  and  restoring  them  to  unbound  is  called  deirailing. 


Not  all  variables  that  arc  bound  have  to  be  trailed.  A  variable  must  only  be  trailed  if  it  continues  to 
exist  on  backtracking,  i.c.  if  its  location  on  the  heap  or  the  environment  is  older  than  the  most  recent 
choice  point.  This  is  called  the  trail  condition. 

(3)  The  push-down  stack.  This  stack  is  used  as  a  scratch-pad  during  the  unification  of  nested  com¬ 
pound  terms. 

(6)  The  code  space.  This  area  holds  the  compiled  code  of  a  program. 

It  is  possible  to  vary  the  organi7.ation  of  the  memory  areas  somewhat  without  changing  anything  substantial 

about  the  execution.  For  example,  some  Prolog  systems  (including  the  Aquarius  system)  combine  the 

environment  and  choice  point  stacks  into  a  single  memory  area.  This  area  is  often  called  the  local  stack. 
« 

Since  the  push-down  stack  is  only  used  during  general  unification,  it  can  be  kept  on  the  top  of  the  heap. 

2  J.2.  Execution  state 


The  internal  state  of  the  WAM  and  the  BAM  is  given  in  Tabic  2.1.  The  differences  between  WAM 
and  BAM  are  indicated  in  the  cable;  The  BAM  adds  the  register  r  (tmp_cp)  for  efficient  interfacing  of 
Prolog  predicates  with  assembly  language.  The  WAM  adds  the  register  r  (s)  and  the  mode  flag  mode 
for  use  by  the  unification  instructions.  The  registers  p  ( I }  arc  not  machine  registers,  but  locauons  in  the 
current  environment,  pointed  to  by  c  (e ) . 


Tabic  2.1  -  Execution  state  of  WAM  and  BAM 

Register 

Description 

r  (el 

Current  environment  on  the  environment  suck. 

r  (a) 

Top  of  the  environment  stack  (WAM  only). 

r(b) 

Top-most  choice  point  on  the  choice  point  suck. 

r<h) 

Top  of  the  heap. 

r(hb) 

Top  of  heap  when  top-most  choice  point  was  created. 

r(tr) 

Top  of  the  trail  stack. 

Program  counter. 

r  (cp) 

Continuation  pointer  (return  address). 

r(tmp  cp) 

Continuation  pointer  to  interface  with  assembly  (BAM  only). 

r  (s) 

Struaurc  pointer  (WAM  only). 

Unification  mode  flag  (value  is  read  or  write,  WAM  only). 

r(0),r(l).... 

Registers  for  argument  passing  and  temporary  storage. 

p(0).p(l),... 

Locations  in  the  current  environment  (permanent  variables). 

2J.3.  The  instruction  set 


Tabic  2.2  contains  the  WAM  instruction  set.  \n9th  a  brief  description  of  what  each  instruction  docs. 
The  get_(...)  and  unify_(...)  instructions  echo  the  put  instructions,  so  their  listing  is  abbre¬ 
viated.  v(N)  is  shorthand  notation  for  r(N)  or  p(N).  “Globalizing"  a  variable  (sec  the 
put_unsafe_value  instruction)  moves  an  unbound  variable  from  the  environment  to  the  heap  to  avoid 
dangling  pointers. 


Table  2.2  -  The  WAM  instruction  set 

Loading  argument  registers  (just  before  a  call) 

put:_variable  v(N),  r(I) 

Create  a  new  variable,  put  in  v(N)  andr(I). 

put_value  v(N),  r(I) 

Movcv(N)  tor  (I). 

put_unsafe_value  v(N),  r(I) 

Move  V  (N)  to  r  ( I )  (and  globalize). 

put_constant  C,  r(I) 

Move  immediate  value  C  to  r  ( l ) . 

put_nil  r(I) 

Move  nil  tor  (I). 

put_structure  F,  r(I) 

Create  functor  F.  put  in  r  ( 1 ) . 

put_list  r(I) 

Create  a  list  pointer,  put  in  r  ( I ) . 

Unifying  with  registers  and  structure  arguments  (head  unification) 

get_(...),  r(I) 

Unify  (...)  with  r  ( I ) . 

unify_( . . . ) 

Unify  (...)  with  suucture  argumcni. 

Procedural  control 

call  Label/  N 

Call  a  predicate. 

execute  Label 

Jump  to  a  predicate. 

proceed 

Return  from  a  predicate. 

allocate 

Create  local  stack  frame. 

deallocate 

Remove  local  stack  frame. 

Selecting  a  clause  (conditional  branching) 

s witch_on_tenn  V,  C,  L,  S 

Four-way  branch  on  r  ( 0 ) ’s  tag. 

switch  on  constant  N,  Tbl 

Hash  table  lookup  of  an  atomic  term  in  r  ( 0 ) . 

switch_on_structure  N,  Tbl 

Hash  ubic  lookup  of  a  functor  in  r  ( 0 ) . 

Backtracking  (choice  point  management) 

try_me_else  Label  try  Label 

Create  a  choice  point. 

retry_ine_else  Label  retry  Label 

Change  retry  address. 

trust_me_else  fail  trust  Label 

Remove  top-most  choice  point. 

23.4.  An  example  of  WAM  code 


Figure  2.7  gives  the  Prolog  definition  and  the  WAM  instructions  for  the  predicate  append/ 3.  fhe 
mapping  between  Prolog  and  WAM  instructions  is  straightforward;  the  switch  instruction  branches  to 
the  right  clause  depending  on  the  type  of  the  first  argumcni,  the  choice  point  (t  ry)  instructions  link  the 
clauses  together,  the  get  instruaions  unify  with  the  head  arguments,  and  the  unify  instructions  unify 
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with  the  arguments  of  su-uctures. 

The  same  insuueiion  sequence  is  used  to  take  apart  an  existing  structure  (read  mode)  or  to  build  a 

t 

new  suucturc  (write  mode).  The  decision  which  mode  to  use  is  made  in  the  get  instructions,  which  set  a 
mode  flag.  For  example,  if  get_list  r(0)  sees  an  unbound  variable  argument,  it  sets  the  flag  to 
wnte  mode.  If  it  sees  a  list  argument,  it  sets  the  flag  to  read  mode.  If  it  sees  any  other  type,  it  fails,  i.e.  it 
backtracks  by  restoring  state  from  the  most  recent  choice  point. 

Choice  point  handling  is  done  by  l^ie  try  instructions.  The  try_nie_else  L  instruction 
creates  a  choice  point,  i.e.  it  saves  all  the  machine  registers  on  a  suck  in  memory.  It  is  compiled  before  the 
first  clause  in  a  predicate.  It  continues  execution  with  the  next  instruction  and  backtracks  to  label  L.  (The 
try  L  instruction  is  identical  to  try_me_else,  except  that  it  continues  execution  at  L  and  backtracks 
to  the  next  instruction.)  The  retry_me_else  L  instruction  modifies  a  choice  point  that  already  exists 
by  changing  the  address  that  it  jumps  to  on  backtracking.  It  is  compiled  before  all  clauses  after  the  first  but 
not  including  the  last.  The  trust_ine_else  fail  instruction  removes  the  top-most  choice  point  from 
the  suck.  It  is  compiled  before  the  last  clause  in  a  predicate. 

233.  Compiling  into  WAM 

Compiling  Prolog  into  WAM  is  straightforward  because  there  is  almost  a  one-to-one  mapping 
between  items  in  the  Prolog  source  code  and  WAM  instructions.  Figure  2.8  gives  a  scheme  for  compiling 
Prolog  to  WAM.  This  compilation  scheme  generates  suboptimal  code.  One  can  optimize  it  by  generating 
switch  instructions  to  avoid  choice  point  creation  in  some  cases  [73]. 

The  clauses  of  predicate  p/3  are  compiled  into  bloeks  of  code  that  are  linked  together  with  try 
insuuctions  to  manage  choice  poinu.  Each  block  consists  of  a  sequence  of  get  instructions  to  do  the 
unification  of  the  head  arguments,  followed  by  a  sequence  of  put  instructions  to  set  up  the  arguments  for 
each  goal  in  the  body,  and  a  call  insuuction  to  execute  the  goal.  The  block  is  surrounded  by  allo¬ 
cate  and  deallocate  insiruaions  to  create  an  environment  for  permanent  variables. 

The  last  call  optimization,  or  LCX)  (also  called  tail  recursion  optimization,  although  it  is  applicable  to 
all  predicates,  not  just  recursive  ones)  converts  a  call  insuuction  followed  by  a  return  into  a  jump,  i.e.  it 


append ( 1) ,  L,  L)  . 

append  ( (Xl  LI] ,  L2,  IX1L3])  :  -  apfSend  (LI ,  L2,  L3)  . 


Prolog  definition  of  append/ 3 


append/ 3 : 

switch_on  term  VI,  Cl,  C2 


VI;  try_me_else  V2 
Cl:  get_nil  r(0) 

get_value  r(l),r(2) 
proceed 

V2:  tqust_ine_else  fail 
C2;  get_list  r{0) 

unify_variable  r(3) 
unify_variable  r(0) 

get_list  r(2) 
unify_value  r(3) 
unify_vatiable  r(2) 
execute  append/3 


fail  ;  Go  to  VI  if  r  (0)  is  a  variable. 

; Go  toCI  if  r  (0)  isaconsiani. 

:  Go  to  C2  if  r  ( 0 )  is  a  list 
;  Fail  if  r  ( 0 )  is  a  structure. 

;  Create  a  choice  point 
:  Unify  r(0)  with  nil. 

;  Unify  r(l)  andr(2). 

;  Return  to  caller. 

;  Remove  choice  point 
;  Stan  uniAcation  of  r  ( 0 )  with  a  list 
;  Load  head  of  list  into  r(3) . 

;  Load  tail  of  list  inior(O). 

;  Stan  unification  of  r  ( 2 )  with  a  list 
;  Unify  head  of  list  with  r  ( 3 ) . 

;  Load  tail  of  list  into  r  (2 ) . 

;  Jump  lo  append/3  (last  call  optimization). 


WAM  code  for  append/ 3 
Figure  2.7  -  Compiling  append/ 3  into  WAM  code 


reduces  memory  usage  on  the  environment  stack.  For  recursive  predicates,  the  LCO  converu  recursion 
into  iteration,  since  the  jump  is  to  the  first  instruction  of  the  predicate.  The  WAM  implements  a  generaliza¬ 
tion  of  last  call  optimization  called  environment  trimming  that  allovfs  the  environment  to  become  smaller 
after  each  call. 


3.  Going  beyond  the  WAM 

Prolog  implementations  have  made  great  progress  in  execution  efficiency  with  the  development  of 
(he  WAM  (82).  However,  these  systems  arc  still  an  order  of  magnitude  slower  than  implementations  of 
popular  imperative  languages  such  as  C.  To  improve  the  execution  speed  it  is  necessary  to  go  beyond  the 
WAM.  This  section  discusses  the  limits  of  the  WAM  and  how  the  four  principles  of  the  Aquarius  compiler 
build  on  the  WAM  to  achieve  higher  performance. 


choice 

poini 


p(E,F,G)  k(X,F.P),  in(S,T),  ... 

p{A.B.C)  qlA.Z.V),  r(W.T,B),  ...,  2(A,X) 


p(Q,R, S)  ... 


Original  Prolog  predicaie 
Compiled  WAM  code 


LI:  ^  try_rae_< 


else  L2 


code  for 
clause  1 


L2:  V  retry_me_else  L3 


code  for 
clause  2 


A 


Ln:  trust  me  else  fail 


code  of 
last  clause 


allocate 

Oreaie  environnient. 

(get  arguments) 

Unify  with  caller  arguments. 

(put  arguments)'^ 
call  q/3  J 
(put  arguments)*^ 
call  r/3  J 

h  Load  arguments  and  call. 

'  Load  arguments  and  call. 

(put  arguments) 

deallocate 

Remove  environment. 

execute  z/2 

Last  call  is  a  junq>. 

A  single  con^iled  clause 


Figure  2,8  -  Compiling  Prolog  into  WAM 


3.1.  Reduce  instruction  granularity 

The  WAM  is  an  degani  mapping  of  Prolog  to  a  sequential  machine.  Its  instructions  encapsulate 
parts  of  the  general  unification  algorithm.  However,  these  pans  are  quite  large,  so  that  many  optimizations 
are  not  possible.  For  example,  consider  the  predicaie; 

p(bar) . 


This  is  compiled  as: 
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get_constant  bar,  r(0) 
proceed 

The  get  constant  instruction  encapsulates  a  sdies  of  operations:  dereference  r(0)  (follow  the 
pointer  chain  to  its  end),  test  its  type,  and  do  cither  read  mode  unification  (check  that  the  value  of  r  ( G )  is 
ba  r)  or  write  mode  unification  (trail  c  ( 0 )  and  store  ba  c  in  its  cell).  All  this  generality  is  often 
unnecessary.  For  example,  if  the  predicate  p(X)  is  always  called  with  a  dereferenced  atom,  then 
unification  reduces  to  a  simple  check  that  the  value  is  correct.  The  other  operations  arc  superfluous. 

The  Aquarius  compiler's  execution  model,  the  BAM,  is  designed  to  retain  the  good  features  of  the 
WAM  w|;ilc  allowing  optimizations  such  as  this  one.  It  retains  dau  structures  and  an  execution  flow  simi¬ 
lar  to  the  WAM.  but  it  has  an  instruction  set  of  finer  granularity  ((3iapter  3).  The  compiler  does  not  use  the 
WAM  during  compilation,  but  directly  compiles  to  the  BAM.  It  is  of  fine  enough  grain  to  allow  extensive 
optimization,  but  it  also  encodes  compactly  the  operations  common  in  Prolog.  For  example,  it  includes  an 
explicit  dereferencing  instruction,  which  makes  it  possible  to  reduce  the  amount  of  dereferencing 
significantly  by  only  doing  it  when  it  is  necessary  and  not  in  every  instruction. 

3.2.  Exploit  determinism 

The  majority  of  predicates  written  by  human  programmers  are  intended  to  give  only  one  solution,  Lc. 
they  are  deterministic.  However,  too  often  they  are  compiled  in  an  inefficient  manner  using  shallow  back¬ 
tracking  (backtracking  within  a  predicate  to  choose  the  correct  clause),  when  they  are  really  just  case  state¬ 
ments.  This  is  inefficient  since  backtracking  requires  saving  the  machine  sutc  and  restoring  it  repeatedly. 

3.2.1.  Measurement  of  determinism 

Measurements  of  Prolog  applications  support  these  assertions; 

(1)  Tick  shows  that  choice  point  references  constitute  about  half  (45-60%)  of  all  dau  references  (69]. 

(2)  Touati  and  Despain  show  that  at  least  40%  of  all  choice  point  and  fail  operations  can  be  removed 
through  optimization  (70). 

The  latter  result  is  especially  interesting  because  it  attempts  to  quantify  how  often  shallow  backuacking  is 


opiimizablc.  It  considers  a  choice  point  to  be  avoidable  if  between  the  access  of  a  choice  point  and  its 
removal  by  a  cut  there  arc  no  calls  to  non-built-in  predicates,  no  returns,  and  only  binding  of  variables  that 
do  not  have  to  be  restored  on  backtracking.  Avoidable  choice  points  do  not  have  to  be  created  because 
they  are  removed  immediately.  For  a  set  of  medium -sized  programs,  on  average  the  following  percentages 
of  choice  point  creations  arc  avoidable:  57%  of  the  ones  removed  by  cut,  43%  of  the  ones  removed  by 
trust,  and  48%  of  the  ones  restored  by  fail.  The  variance  of  these  numbers  is  large,  but  the  potential  for 
optimization  when  these  situations  do  occur  is  significant.  The  Aquarius  compiler  is  able  to  lake  advantage 
of  these  optimizations  and  more,  e.g.  due  to  the  factoring  transformation  (Chapter  4)  it  is  able  to  compile 
the  part  it  ion/ 4  predicate  in  Warren's  quickson  benchmark  [30]  into  deterministic  code.  The  optimi¬ 
zations  are  synergistic,  that  is.  doing  them  makes  other  improvements  possible: 

(1)  Less  stack  space  is  needed  on  the  environment/choice  point  stack.  Choice  points  and  environments 
arc  both  stored  on  this  stack,  which  means  that  often  a  clause's  environment  is  hidden  underneath  a 
more  recently  created  choice  point.  When  this  happens  the  last  call  optimization  is  not  able  to 
recover  space.  If  fewer  choice  points  are  created,  then  last  call  optimization  is  effective  more  often. 

(2)  There  are  fewer  memory’  references  to  the  heap  because  binding  a  variable  is  postponed  until  a 
clause  is  chosen. 

(3)  There  is  less  trailing  because  it  is  only  needed  for  bindings  that  cross  a  choice  point. 

(4)  Garbage  collection  is  more  efhcicni,  since  the  creation  of  fewer  choice  points  means  that  there  are 
fewer  starting  points  for  marking. 

3,2,2,  Ramifications  of  exploiting  determinism 

The  goal  of  compiling  deterministic  predicates  into  ef&cient  conditional  branches  affects  a  large  part 
of  the  compiler.  Many  of  the  transformations  done  in  the  compiler  are  intended  to  increase  the  amount  of 
determinism  that  is  easily  accessible.  This  includes  formula  manipulation,  factoring,  head  unraveling,  the 
determinism  transformation  (all  in  Chapter  4).  the  determinism  compiler  (Chapter  5),  and  the  determinism 


optimization  ((Chapter  6). 
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Through  these  uansformations  the  compiler  creates  a  decision  graph  to  index  the  arguments  of  a 

predicate.  Type  information  derived  by  dataflow  analysis  is  exploited  to  simplify  the  graph.  The  graph  is 

* 

3 

created  in  an  architecture-independent  way  through  the  concept  of  the  test  set  (Chapter  4).  Intuitively,  a 
test  set  is  a  set  of  Prolog  predicates  that  are  mutually  disjoint  (only  one  can  succeed  at  any  given  time)  and 
that  correspond  to  a  multi-way  branch  in  the  architecture. 

3  J.  Specialize  unification 

The  WAM  unification  instructions  (get  and  unify)  arc  complex.  They  operate  in  two  modes 
(read  mode  and  write  mode)  depending  on  the  type  of  the  object  being  unified,  they  dereference  their  argu¬ 
ments.  and  they  uail  variable  bindings.  It  is  bcuer  to  compile  unification  directly  into  simpler  instructions. 

In  the  Aquarius  compiler,  unification  is  compiled  into  the  simplest  possible  BAM  code  taking  the 
type  information  into  account  (Chapter  S).  Often  it  is  possible  u>  reduce  a  unification  to  a  single  load  or 
store.  The  use  of  uninitialized  variables  (see  below)  to  simplify  variable  binding  greatly  improves  the  gen¬ 
erated  code. 


registers 


memory 


value  ignored 
(  I  value  important 


Figure  2.9  -  Three  categories  of  unbound  variables 


3.3.1.  SimpfiTying  variable  binding 


A  major  source  of  inefficiency  in  WAM  implementations  is  that  logical  variables  are  often  created  as 
unbound  (i.e.  as  self-referential  pointers)  and  then  unified  soon  afterwards.  Creating  and  unifying  does 
much  unnecessary  work;  it  would  be  faster  just  to  reserve  a  memory  location  and  then  write  to  it.  The 
Aquarius  compiler  defines  such  a  representation,  called  uninitialized  variables.  Conceptually,  uninitialized 
variables  arc  defined  at  two  levels: 

(1)  At  the  logical  level,  an  uninitialized  ^^ariable  is  an  unbound  variable  that  is  not  aliased,  i.e.  there  are 
no  other  variables  bound  to  it.  The  dataflow  analyzer  (Chapter  4)  uses  this  definition  to  derive  unini¬ 
tialized  variable  types. 

f 

(2)  At  the  ifnplementation  level,  an  uninitialized  variable  is  a  location  that  is  allocated  to  contain  an 
unbound  variable,  but  the  location  is  not  given  a  value.  The  kernel  Prolog  eompiler  (Chapters  4,  S. 

*  and  6)  uses  this  definition  to  compile  uninitialized  variables  efficiently. 

The  location  containing  an  uninitialized  variable  can  either  be  a  register  or  a  memory  word,  resulting  in 
two  kinds  of  uninitialized  variables,  namely  uninitialized  register  and  uninitialized  memory  variables.  The 
first  are  registers  whose  contents  are  ignored.  The  second  are  pointers  (o  memory  locations  whose  contents 
arc  ignored.  Standard  unbound  variables  arc  called  initialized  variables;  they  are  pointers  to  locations 
pointing  to  themselves.  Figure  2.9  illusuates  the  three  categories  of  unbound  variables. 


Table  2.3  -  The  cost  of  uninitialized  variables 


Type  of  variable 

Cost  (VLSI-BAM  cycles) 

For  Unification 

For  Backtraciung 

Creation 

Binding 

Trailing 

Deliailing 

Uninitialized  Register 

0 

0 

0 

0 

Uninitialized  Memory 

1 

1 

0 

0 

Initialized  Variable 

2 

5 

2 

0or4 

The  dataflow  analyzer  derives  both  uninitialized  register  and  uninitialized  memory  types,  it  is  often 
able  to  determine  that  an  argument  is  uninitialized;  for  a  representative  set  of  programs  it  finds  that  23%  of 
all  predicate  arguments  are  uninitialized.  Of  thc»;.  two  thirds  have  uninitialized  memory  type  and  one 
third  have  uninitialized  register  type. 


Tabtc  2.3  gives  the  minimum  nin-lime  costs  on  the  VLSl-BAM  processor  for  the  three  categories  of 
unbound  variables.  Costs  are  given  for  unification  support  (creation  and  binding)  and  for  backtracking  sup- 

s 

port  (trailing  and  dcirailing).  Binding  an  initialized  variable  is  expensive  because  the  variable  must  be 
dereferenced  before  the  new  value  can  be  stored  in  the  memory  cell.  Binding  an  uninitialized  memory 
variable  reduces  to  a  single  memory  store  operation.  Binding  an  uninitialized  register  variable  is  free  if  it 
is  created  in  the  register  that  needs  it.  The  cost  of  detrailing  (restoring  a  variable  to  an  unbound  state  on 
backtracking)  is  zero  for  uninitialized  variables.  For  initialized  variables  it  depends  strongly  on  the  effec¬ 
tiveness  of  the  compiler  in  generating  deterministic  code.  It  is  0  cycles  if  the  variable  does  ikx  have  to  be 
unbounefon  backtracking,  and  4  cycles  otherwise. 

t 

3.4.  Dataflou*  analysis 

The  Aquarius  compiler  implements  a  dataflow  analyzer  that  is  based  on  abstract  inteipretation.  It 
translates  the  program  to  one  in  which  predicate  arguments  range  over  a  finite  set  of  values.  Each  of  the 
values  corresponds  to  an  infinite  set  of  values  (i.e.  a  type)  in  the  original  program.  The  analyzer  derives  a 
small  set  of  types — uninitialized,  ground  (the  argument  coni^ns  no  unbound  variables),  nonvaiiable  (the 
argument  is  not  an  unbound  variable)  and  recursively  dereferenced  (the  argument  is  dereferenced,  i.e.  it  is 
accessible  without  pointer  chasing,  and  if  it  is  compound,  (hen  all  its  arguments  are  recursively  derefer¬ 
enced).  These  types  have  been  chosen  carefully  to  be  useful  during  compilation. 

Dataflow  analysis  by  itself  is  not  enough.  The  rest  of  the  system  must  be  able  to  use  the  information 
derived  by  the  analysis.  The  techniques  to  exploit  determinism  and  specialize  unification  in  the  Aquarius 
compiler  have  been  developed  in  tandem  with  the  analyzer  for  this  purpose.  In  addition,  the  fine  instruc¬ 
tion  granularity  of  (he  BAM  is  designed  to  support  these  optimizations. 

4.  Related  work 

First  a  survey  is  given  of  work  that  is  related  to  the  four  principles  of  the  Aquarius  compiler.  Then 
an  overview  is  given  of  Prolog  implementations  that  are  interesting  in  some  way. 


4.1.  Reduce  instruction  granularity 


Tamura  ct  al  (39.65]  have  done  fundamental  wot^k  at  IBM  Japan  in  reducing  the  grain  size  of  com¬ 
piled  operations  for  Prolog.  Their  compilation  is  done  in  three  steps.  The  first  step  is  to  compile  Prolog 
into  WAM.  In  the  second  step  the  intermediate  code  is  translated  into  a  directed  graph.  Each  WAM 
instruction  becomes  a  subgraph  containing  simple  operations  such  as  ease  selection  on  tags,  jumps,  assign¬ 
ments,  and  dereferencing.  The  graph  is  optimized  through  rewrite  rules.  Case  selections  based  on  a  lag 
value,  never-selected  cases,  redundant  tests,  case  statements  with  only  one  branch,  and  unreachable 
instructions  arc  eliminated.  Known  values  are  propagated.  These  rewrites  are  applied  several  times  and 

the  resulting  graph  is  then  translated  back  into  intermediate  code.  In  the  third  step  the  intermediate  code  is 
« 

translated  into  a  PL.8  program  which  is  sent  to  a  high-quality  PL.8  optimizing  compiler  [3].  Performance 
results  are  given  for  a  few  small  programs  and  are  quite  good.  There  are  several  problems  in  their 
approach.  They  still  use  the  WAM  as  an  intermediate  language,  and  compiling  is  prohibitively  slow 
because  their  system  is  experimental.  Without  compile-time  hints  their  performance  drops  significantly. 

4J.  Exploit  determinism 

Significant  improvements  over  the  WAM  are  possible  to  avoid  choice  point  creation  in  deterministic 
predicates.  The  WAM  indexes  on  only  the  first  argument  and  saves  all  registers  in  choice  points.  Turk 
(72]  describes  several  optimizations  that  reduce  the  time  necessary  to  restore  machine  state  when  back¬ 
tracking.  In  (74],  I  describe  a  compilation  scheme  that  attempts  to  take  advantage  of  the  fact  that  most  Pro¬ 
log  predicates  arc  deterministic.  Choice  point  creation  and  moves  to  and  from  choice  points  are  minim¬ 
ized.  Clauses  are  compiled  with  multiple  entry  points  and  predicates  are  compiled  as  decision  trees.  The 
techniques  used  in  the  Aquarius  system  are  inspired  by  this  work.  Carlsson  (IS]  measures  the  perfoimaitce 
improvement  of  a  scheme  for  creating  choice  points  in  two  parts,  saving  only  a  small  pan  of  the  machine 
state  first,  and  postponing  saving  the  remainder  until  later  in  the  clause  when  it  can  be  determined  that  the 
head  unification  and  any  simple  tests  have  succeeded.  Implemented  in  the  SICStus  Prolog  system,  this 
reduces  execution  time  by  7-IS%  on  four  large  programs. 


Recently  there  have  appeared  several  commercial  Ptolog-ltke  languages  (Trilogy  and  Turbo  Prolog) 
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that  generate  efficient  code  for  programs  annotated  with  type  and  determinism  declarations.  In  this  regard 
Trilogy  [79]  is  noteworthy  because  it  gives  a  logical  semantics  to  programs  wriuen  in  a  Pascal-like  nota- 

4 

tion.  Typed  predicates  that  are  annotated  as  being  deterministic  are  compiled  into  efficient  native  code. 
The  achievement  of  Trilogy  is  reassuring;  since  many  predicates  in  standard  Prolog  are  intended  to  be  exe¬ 
cuted  in  a  deterministic  way,  with  some  analysis  it  should  be  possible  to  obtain  the  same  efficiency  for 
standard  Prolog. 

Several  systems  have  generalized  the  first  argument  indexing  of  the  WAM.  BlM_Prolog  [4]  can 
index  on  any  argument  when  given  appropriate  declarations.  SEPIA  (29]  incorporates  heuristics  to  decide 
which  predicate  arguments  are  important  for  deterministic  selection.  It  uses  the  first  “indexable"  argu¬ 
ment  of  a  predicate.  If  there  are  several  possibilities  it  first  uses  the  argument  where  it  is  more  likely  that 
fewer  clauses  will  be  selected. 

,  Several  papers  describe  fast  implementations  of  the  cut  operation.  Bowen  et  al  [9]  implement  cut  by 
adding  a  register  that  holds  the  address  of  the  most  recent  choice  point  before  entering  the  predicate.  This 
register  is  updated  by  each  call  and  execute  insmtetion.  Cut  is  implemented  by  moving  this  regis¬ 
ter  to  the  WAM's  choice  point  register  r<b).  Marien  and  Demoen  (46]  implement  cut  in  a  similar 
fashion.  These  schemes  suffer  from  having  to  do  an  additional  register  move  for  each  procedure  call, 
unless  a  different  call  instruction  is  used  for  predicates  with  and  without  cut  The  scheme  implemented  in 
the  Aquarius  compiler  does  not  slow  down  procedure  calls  and  does  not  need  an  additional  register. 

43.  Specialize  unification 

Significant  improvements  over  the  WAM  are  possible  for  unification.  Turk  [72]  describes  several 
optimizations  related  to  compilation  of  unification,  to  reduce  the  overhead  of  explicitly  maintaining  a 
read/wriic  mode  bit  and  remove  some  superfluous  dereferencing  and  ug  checking.  Marien  (44]  describes  a 
method  to  compile  write  mode  unification  that  uses  a  minimal  number  of  memory  operations  and  avoids  all 
superfluous  dereferencing  and  ug  checking.  In  (7S],  1  build  on  this  work  by  introducing  a  simplified  nou- 
tion  and  extending  it  for  read  mode  unification,  but  my  scheme  suffers  from  a  large  code  size  expansion. 
The  Aquarius  system  modifies  this  technique  to  limit  the  code  size  expansion  at  a  slight  execution  time 
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cost.  Meier  (48)  has  developed  a  technique  that  generalizes  Marien's  idea  for  both  read  and  write  mode 
and  achieves  a  linear  code  size,  also  with  a  slight  execution  time  cost.  Hiis  technique  is  implemented  in 

s 

the  SEPIA  system  (29). 

Beer  (5)  ha.s  suggested  the  use  of  a  simplified  representation  of  Prolog  variables  for  which  binding  is 
much  faster.  He  introduces  several  new  tags  for  this  representation,  which  he  calls  uninitialized  variables . 
and  keeps  track  of  them  at  rtin-timc.  He  shows  that  both  dereferencing  and  trailing  arc  reduced 
significantly.  This  idea  was  a  strong  influence  on  the  Aquarius  compiler.  At  the  Prolog  level,  logical 
semantics  are  preserved,  but  at  the  code  level  there  is  now  a  coherent  integrated  use  of  destructive  assign¬ 
ment  for  values  that  fit  in  a  register.  My  scheme  is  different  from  Beer's — it  uses  the  same  tag  for  both 
< 

uninitialized  and  standard  Prolog  variables.  The  analyzer  finds  uninitialized  variables  at  compile-time  and 
the  compiler  determines  when  it  is  safe  to  use  destructive  assignment  to  bind  them. 

4.4.  Dataflow  analysis 

R.  Warren  ct  al  (841  have  done  the  most  comprehensive  work  measuring  the  practicality  of  global 
dataflow  analysis  in  logic  programming.  Their  paper  describes  two  dataflow  analyzers:  (1)  MA*.  the  MCC 
And-parallcl  Analyzer  and  Annotator,  and  (2)  Ms.  an  experimental  analysis  scheme  developed  for  SB- 
Prolog.  MA^  derives  aliasing  and  ground  types  and  keeps  track  of  the  structure  of  compound  terms,  while 
Ms  derives  ground  and  nonvariablc  types.  The  paper  concludes  that  both  dataflow  analyzers  are  effective 
in  deriving  types  and  do  not  increase  compilation  time  by  too  much.  My  dataflow  analyzer  differs  from 
both  MA^  and  Ms  in  three  ways.  First,  the  analyzer  works  over  a  different  domain.  Second,  it  avoids 
problems  with  aliased  variables  by  deriving  only  limited  type  information  for  them.  Third,  it  is  integrated 
into  a  compiler  which  has  been  developed  to  take  fult advantage  of  the  types  it  derives. 

For  correctness,  it  is  imperative  to  consider  the  effects  of  variable  aliasing  on  dataflow  analysis. 
Aliasing  occurs  when  two  variables  arc  bound  to. terms  that  have  variables  in  common.  Finding  accurate 
aliasing  information  is  an  important  topic  in  current  research  [18,36].  However,  aliasing  complicates  the 
implementation  of  dataflow  analysis.  My  analyzer  considers  only  unalia.scd  variables  as  candidates  for 
unbound  variable  types.  Measurements  of  the  analyzer  show  that  unaliased  variables  occur  often  enough 
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lo  make  (he  analysis  worthwhile.  This  conservative  ucauncni  of  aliasing  simplifies  the  implemenuiion. 
since  it  is  not  necessary  to  explicitly  represent  and  propagate  aliasing  information.  Of  course,  it  also 
reduces  the  effeciiveness  of  the  analysis.  Thus  aliasing  needs  to  be  studied  further. 

Maricn  el  al  [45]  have  performed  an  interesting  experiment  in  which  several  small  Prolog  predicates 
(recursive  list  operations)  were  hand-compiled  with  several  levels  of  optimization  based  on  information 
derivable  from  a  dataflow  analysis.  The  analysis  was  done  by  hand  at  four  levels;  The  first  level  derives 
unbound  variable  and  ground  modes.  The  second  level  also  derives  recursively  defined  types.  The  third 
level  also  derives  lengths  of  dereference  chains  (pointer  chains  that  must  be  followed  at  run-time).  The 
fourth  level  also  derives  livcncss  information  for  compound  data  structures  and  is  used  to  determine  when 
they  arc  last  u^ed  so  that  their  memory  may  be  recovered  (compile-time  garbage  collection).  Execution 
time  measurements  show  that  each  analysis  level  improves  speed  over  the  previous  level.  This  experiment 
shows  that  a  simple  analysis  can  achieve  good  results  on  small  programs. 

4.5.  Other  implementations 

This  section  gives  an  overview  of  interesting  Prolog  implcmenuiions  that  arc  related  to  this  disserta¬ 
tion  in  some  way.  Most  existing  implemenuiions  of  Prolog,  both  on  general-purpose  and  special -purpose 
machines,  arc  based  on  the  Warren  Abstract  Machine  (WAM)  or  are  derived  from  it.  The  general-purpose 
and  special-purpose  approaches  are  presented  separately.  The  first  subsection  describes  some  important 
software  implementations  and  (heir  ideas.  The  second  subsection  summarizes  some  important  architec¬ 
tures  and  their  innovations. 

4.5.1.  Implementing  Prolog  on  general-purpose  machines 

As  far  as  I  know,  the  earliest  WAM  compiler  was  my  PLM  compiler,  completed  and  published  in 
August  1984  173].+  The  compiler  was  interesting  as  it  was  itself  written  in  Prolog,  unlike  many  later  Prolog 
compilers.  The  first  commercial  implementation  of  the  WAM  was  Quintus  Prolog,  announced  in 
November  1984. 


t  The  PLM  oompilcf  is  still  avaiUMe  fiom  us.  but  is  now  obsoieic  and  na  lecommendcd  tor  current  research  woii.  Our 
research  group  expects  to  release  soon  a  complete  Prolog  system  based  on  (he  Aquarius  compiler. 
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Among  the  highest  performance  commercial  implementations  available  today  arc  IBM  Prolog, 
Quintus  Prolog  (58j,  BIM_Prolog  (4],  and  ALS  Prolog  |21.  There  arc  three  significant  implementations  of 
Prolog  available  today  that  were  developed  at  research  institutions;  SICStus  Prolog  [63],  SEPIA  (29),  and 
SB-Prolog  (83).  All  of  these  systems  are  based  on  extensions  of  the  WAM  (except  possibly  IBM  Prolog, 
of  which  I  have  little  information)  and  compile  to  WAM-like  instructions  which  are  either  emulated  on  the 
target  machine  or  macro-expanded  to  native  code.  Some  of  these  systems  (e.g.  SB-Prolog  and  IBM  Pro¬ 
log)  arc  able  to  compile  special  cases  of  deterministic  programs  into  efficient  code. 

4.5.1. 1.  Taylor’s  system 

independently  of  this  research,  Andrew  Taylor  is  implementing  a  high  performance  Prolog  compiler 
for  the  MIPS  processor  (67).  The  compiler  includes  a  dataflow  analyzer  that  explicitly  represents  type, 
aliasing,  dereference  chain  lengths,  and  trailing  information  [66].  His  preliminary  results  indicate  that  it  is 
of  com)parablc  performance  to  the  compiler  presented  in  this  dissertation.  Running  a  set  of  small  bench¬ 
mark  programs  on  the  MIPS  R2030  processor,  the  system  is  24  times  faster  than  compiled  SICStus  Prolog 
version  0.6  and  the  code  size  is  similar  to  that  of  the  KCM. 

4.5.1.2.  IBM  Prolog 

IBM  Prolog  accepts  mode  declarations,  implements  more  general  indexing  than  the  WAM,  does  a 
limited  global  analysis  (howeva,  it  docs  not  derive  any  types),  and  generates  high  performance  native 
code.  It  is  able  to  compile  some  kinds  of  deterministic  programs  with  conditional  branches. 

4.S.U.  SICStus  Prolog 

SICStus  Prolog  was  developed  at  the  Swedish  Institute  of  Computer  Science  in  Stockholm.  A  back¬ 
end  module  was  written' for  it  by  Mats  Carlsson  which  generates  native  code  avoiding  the  superfluous 
memory  references  of  a  naive  WAM  translation  (14,44].  it  is  comparable  in  performance  to  Quintus  Pro¬ 
log  when  no  built-in  predicates  arc  used. 
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4.5.I.4.  SB-Prolos 

SB-Prolog  was  developed  ai  SUNY  in  Slony  Brook.  It  recognizes  a  special  case  of  t'.c  general  tech¬ 
niques  for  exiraciing  deicrminism  discussed  in  (his  dissertation:  it  recognizes  when  anUunetic  tests  that  are 
each  other's  opposites  appear,  and  compiles  a  conditional  branch.  It  also  incorporates  a  simple  partial 
evaluator  which  is  used  for  macro  expansion  and  a  simple  dataflow  analysis  scheme  has  recently  been 
developed  for  it  (84J. 

4.5.2.  Implementing  Prolog  on  special-purpose  machines 

In  the  past,  because  (he  WAM  was  regarded  as  the  best  way  to  implement  Prolog,  the  performance 
« 

gap  between  special-purpose  architectures  and  general-purpose  architectures  was  large.  Much  of  the  effort 
in  high  performance  Prolog  implementation  was  put  into  architecture  design,  and  in  particular  in  hardware 
suppoa  for  the  WAM  instructions.  This  dissertation  shows  that  a  better  understanding  of  Prolog  execution 
narrows  the  performance  gap.  The  implications  of  this  development  for  the  future  of  special-purpose 
architectures  are  discussed  in  the  VLSI-B  AM  paper  [34]  and  summarized  in  (his  seaion. 

4.5.2. 1.  PLM 

The  first  special-purpose  Prolog  architecture  that  was  built  is  (he  PLM  (Programmed  Logic 
Machine),  due  to  Dobry  ct  al  f26-28I.  Its  design  was  inspired  by  a  proposal  of  Tick  &  Warren  168).  The 
PLM  implements  the  WAM  in  microcode  with  a  KX)  ns  clock  cycle.  It  was  built  on  wire-wrap  boards  and 
ran  a  few  small  programs  in  1983.  Spin-offs  of  (his  project  included  the  VLSl-n.M  single-chip  implemen¬ 
tation  {60]  and  the  Xenologic  X-l,  a  commercial  coprocessor  for  Sun  workstations. 

Several  papers  have  compared  the  number  of  cycles  needed  by  the  PLM  to  that  of  general-purpose 
architectures.  Hiese  ratios  are  valid  measurements  of  (he  effect  of  the  PLM’s  architeaural  support  for 
WAM  implementation.  Mulder  &  Tick  [3 1]  and  Patt  &  Chen  |S4]  have  compared  the  performance  of  the 
PLM  (28),  a  microcoded  implementation  of  the  WAM.  to  a  macro-expanded  WAM  on  the  MC68020  pro¬ 
cessor.  They  find  (hat  the  MC68020  needs  3  lo  4  times  the  number  of  cycles  as  the  PLM  to  execute  the 
WAM.  Patt  and  Chen  find  that  static  code  size  on  the  MC68020  is  about  20  times  the  PLM. 


4.5.2.2.  SPUR 


Borricllo  cl  al  {8]  have  implemcnicd  a  macro^xpanded  WAM  on  ihc  SPUR  processor  (Symbolic 
Processing  Using  RISCs).  They  find  that  the  SPUR  takes  about  2.0  times  the  number  of  cycles  as  the  PLM 
and  that  static  code  si/c  is  about  14  times  the  PLM.  These  numbers  include  local  optimizations  imple¬ 
mented  by  Chen  and  Nguyen  [20]  that  improve  the  original  numbers  by  about  10%. 

4.5.2.3.  PSI-n  and  PIM/p 

In  the  context  of  the  FGCS  (Fi^th  Generation  Computer  System)  project,  researchers  of  ICOT  (the 
Japanese'  Institute  for  New  Generation  Computer  Technology)  have  designed  and  built  several  sequential 
and  parallel  arcfiitectures  for  logic  programming  (64.71).  One  of  the  more  interesting  sequential  machines 
is  the  PSI-II  (Personal  Sequential  Inference  machine  II)  (52)  a  microcoded  implementation  of  the  WAM 
which  executes  at  speeds  similar  to  the  PLM.  The  processing  elements  of  the  PIM/p  (Parallel  Inference 
Machine)  architecture  arc  currently  the  highest  perfonnance  sequential  logic  machines  at  ICOT.  They  exe¬ 
cute  at  two  to  three  times  the  speed  of  the  PLM. 

4.5.2.4.  KCM 

Benkcr  et  al  (6)  describe  a  special-purpose  Prolog  machine,  the  KCM  (Knowledge  Crunching 
Machine),  which  is  based  on  an  extended  WAM.  Its  instruction  set  consists  of  two  parts;  a  general-purpose 
instruction  set,  and  a  microcoded  Prolog-specific  instruction  set.  It  has  a  cycle  time  of  80  ns  and  executes 
in  about  1/3  the  number  of  cycles  of  the  PLM.  Its  code  size  is  about  three  times  greater.  The  KCM  project 
was  done  together  with  the  development  of  a  Prolog  system  and  environment  called  SEPIA  (see  previous 
section).  About  60  KCM  machines  were  constructed  and  delivered  to  the  ECRC  member  companies. 

4JJ.5.  VLSI-BAM 

Holmcr  et  al  (34]  describe  a  single-chip  microprocessor  with  extensions  for  Prolog,  the  VLSI-BAM 
(VLSI  Berkeley  Absuaci  Machine).  It  is  a  pipelined  load-store  processor  with  a  cycle  time  of  33  ns.  It 
takes  about  1/3  the  number  of  cycles  to  run  programs  as  the  PLM  and  its  code  size  is  about  three  times 


greater,  results  similar  to  the  KCM.  However,  they  are  achieved  largely  through  the  effort  of  the  compiler. 
The  goal  of  the  BAM  project  is  to  find  the  minimal  extensions  to  a  general-purpose  architecture  to  suppon 

i 

a  high  performance  Prolog  implementation.  The  rationale  for  the  VLSl-BAM  architecture  is  that  existing 
general-purpose  architectures  arc  designed  to  execute  imperative  languages  like  C  and  do  not  have  ade¬ 
quate  suppon  for  Prolog.  The  compiler  described  in  this  dissertation  was  developed  simultaneously  with 
the  architecture,  and  interaction  between  the  two  designs  has  significantly  improved  both. 

The  BAM  project  has  determined  that  a  small  amount  of  architectural  suppon  (5%  increase  in  chip 
area)  gives  a  large  performance  boost  (50%  performance  increase)  for  programs  that  use  Prolog-specific 
features?  The  suppon  docs  not  interfere  with  the  general-purpose  archiiecuire,  so  it  is  possible  for  future 
general-purpose  machines  to  incorporate  this  suppon  for  high  performance  symbolic  computing.  The  sup¬ 
pon  is  designed  specifically  to  suppon  the  logical  variable,  dynamic  typing,  unification,  and  backtracking. 
A  language  that  uses  any  of  these  features  can  benefit  from  it. 
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Chapter  3 

The  Two  Representation  Languages 


1.  Introduction 

This  chapter  defines  the  two  languages  used  by  the  compiler  to  represent  programs:  kernel  Prolog,  a 
simplified  form  of  Prolog,  and  the  Berkeley  Abstract  Machine  (BAM),  a  low-level  instruction  set  and  exe¬ 
cution  model  that  is  close  to  a  standard  sequential  processor.  Kernel  Prolog  is  an  internal  language  that  is 
not  accessible  to  the  user.  BAM  is  the  output  language  of  the  compiler. 

2.  Kernel  Prolog 

e 

The  first  representation  language  in  the  compiler  is  kernel  Prolog,  a  simplified,  canonical  form  of 
Prolog.  The  synux  of  kernel  Prolog  is  given  in  Rgurc  3.1.  This  should  be  compared  with  the  definition  of 
fulf  Prolog  syntax  given  in  Chapter  2.  The  control  flow  of  kernel  Prolog  is  simpler,  a  set  of  internal  primi¬ 
tives  is  defined  that  arc  only  used  inside  the  compiler,  and  a  case  statement  is  defined.  Kernel  Prolog  does 
not  have  nested  disjunctions,  if-then-else,  cut,  negation,  or  arithmetic  expressions.  Each  predicate  is 
represented  as  a  single  term  ( H :  -o )  containing  a  head  H  with  distina  vari^le  arguments  and  a  body  D 
that  is  a  single  disjunction  (an  OR  choice).  Each  alternative  of  the  disjunaion  is  a  conjunction,  i.e.  an 
AND  sequence  of  goals.  Unifications  in  th.*.  head  of  the  original  predicate  are  represented  as  explicit 
unifications  in  the  arms  of  the  disjunction.  Disjunctions,  negations,  and  if-then-else  forms  in  the  original 
predicate  are  converted  into  dummy  predicates.  Cut  and  arithmetic  expressions  are  convened  into  simpler 
internal  built-in  predicates. 

For  example,  the  predicate: 
a(b) . 

a(X)  (  0  is  X  mod  2  ->  «(X)  ;  £  (X)  ). 
is  represented  as  follows  in  kernel  Prolog: 
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predicate  (  (H : -D) )  head(H)  ,  disjunction'(D)  . 


head(H)  goal_term{H) . 

disjunction (fail) . 

disjunction ( (C;D) )  con junction (C) ,  disjunction (D) . 
con junction (true) . 

con junction { (G, C) )  goal(G),  con junction  (C) . 

goal (G)  case_goal (G) . 

goal (G)  internal_goal (G) . 

goal (G)  external_goal (G) . 

case_goal (* 5case' (Name, Ident.CB) )  test_set (Name.  Ident) ,  case_body(CB) . 
case_body (• Seise' (D) )  disjunction (0) . 

case_body ( ('Stesf  (T.D) ;CB) )  ;-test(T).  disjunction (D) .  case_body (CB) . 

external_goal (G)  goal_term(G) ,  \+case_goal (G) ,  \i-internal_goal (G) . 

term(T)  var(T). 

term(T)  goal_tenn(T) . 

goal_term(T)  nonvard).  functor{T.  A),  tenn_ar9s(l.  A,  T)  . 
term_args (1,  A.  _)  I>A. 

term_args(I,  A,  T)  I-<A,  arg(I,  T.  X),  tetm{X) .  11  is  I+l,  term_args (11,  A,  T) . 

%  Predicates  defined  in  tables: 
internal_goal  (G)  (Defined  in  Tabic  3.1). 
test_set  (Name,  Ident)  (Defined  in  Table 4.11). 

test  (T)  (Defined  in  Tabic  4.1 1). 

%  Built-in  predicates  needed  in  the  definition: 
functor  (T,  F,  A)  :-  (Tcrni  T  has  functor  F  and  aiiiy  A), 
argd,  T,  X)  :-  (Argument  I  of  compound  term  T  is  X). 
var(T)  :-  (Argument  T  is  an  unbound  variable). 
nonvar(T)  :-  (Argument  T  is  a  nonvariable). 


Rgure  3. 1  -  Syntax  of  kernel  Prolog 


i 
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a(X)  (  X-b,  true 

;  *Sd'{X).  true 
;  fail 
)  . 

'Sd'(X)  (  '$cut_load’ (2),  •$d2"(X,  Z) .  true 

;  fail 
)  - 

'Sd2'{X.  Z)  {  *$inod*  (X,2,0),  ‘ScutMZ).  e(X),  true 
;  f  (X)  ,  true 
;  fail 
}  - 

All  predicates  that  stan  with  the  character  '  arc  created  internally.  Cut  is  implemented  with  the  two 
built-ins  $cut_load' (X)  and  'Scut*(X).  The  arithmetic  expression  0  is  X  mod  2  is 
replaced  by  a  call  to  an  explicit  arithmetic  built-in  '  $inod'  (X,  2,0).  The  if-then-elsc  is  replaced  by  a 
call  to  the  dummy  predicate  '  $d'  (X) .  All  dummy  pr  dicaies  are  given  unique  names. 

Kernel  Prolog  has  many  advantages  over  standard  Prolog.  The  scope  of  variables  is  not  limited  to  a 
single  clause,  but  is  extended  over  the  whole  predicate.  Many  optimizations  are  easier  to  do — for  example, 
dataflow  analysis  and  determinism  extraction.  Compilation  to  BAM  code  and  register  allocation  are 
simplified. 

The  following  two  sections  describe  the  internal  predicates  of  kernel  Prolog  and  how  standard  Prolog 
is  converted  to  kernel  Prolog. 

2.1.  Internal  predicates  of  kernel  Prolog 

The  kernel  Prolog  form  of  a  program  contains  predicates  that  are  not  part  of  standard  Prolog  and  that  are 
invisible  to  the  user.  The  internal  predicates  always  b^in  with  the  character  '  $' .  They  are  of  three 
kinds: 

(1)  Internal  built-in  predicates  (Tabic  3.1).  These  arc  classified  into  three  categories  depending  on 
their  use;  (1)  implementation  of  cut,  (2)  tyfK  checking,  and  (3)  arithmetic.  They  are  expanded  into 
BAM  instructions  before  being  output,  so  the  user  never  sees  them. 

(2)  A  case  statement.  This  control  structure  is  designed  to  express  dacrministic  selection  in  Prolog. 
Chapter  4  describes  how  the  case  statement  is  created.  It  is  translated  directly  into  conditional 


Table  3. 1  -  Internal  built-ins  of  kernel  Prolog 

Built-in 

Description 

' 5cut_load' (X) 

Load  the  choice  point  register  r  (b)  into  X. 

'  Scut' (X) 

Make  the  choice  point  pointed  to  by  X  the  new  top  of  the 
choice  point  stack. 

' Sname  arity' (X, Na, Ar ) 

Test  that  X  has  functor  Na  and  arity  Ar.  This  only  docs  a 
check;  it  never  binds  X. 

' Stest' (X,T) 

General  type-checking  predicate  that  tests  whether  the  type  of 
X  is  in  the  set  T,  where  T  c  (unbound  variable,  nil,  non-nil 
atom,  negative  integer,  nonnegative  integer,  float,  cons,  struc¬ 
ture). 

' Sequal' (X, Y) 

Test  (hat  X  and  Y  arc  identical  simple  terms. 

'  Sadd'  (S1,S2,D) 

Integer  addition  D  «-  S 1+S2. 

' Ssub' (S1,S2,D) 

Integer  subtraction  D  «-  S1-S2. 

'Smul' (S1,S2,D) 

Integer  multiplication  D  *-  SI*S2. 

'Sdiv' (S1,S2,D) 

Integer  division  D  «-  S 1/S2. 

'Smod' (S1.S2,D) 

Integer  remainder  D  SI  mod  S2. 

'  Sand'.(Sl,S2,D) 

Bitwise  integer  "and"  D «—  Si  a  S2. 

'  Sor'  {S1,S2,D) 

Bitwise  integer  “or”  D  <-  SI  v  S2. 

'  Sxor'  (S1,S2,D) 

Bitwise  integer  exclusive-or  D  <-  S 1  0  S2. 

'  Ssll'  (S1,S2,D) 

Logical  left  shift  D  S  I«S2. 

'  Ssra'  (S1,S2,D) 

Arithmetic  right  shift  D  ♦-  S1»S2. 

' Snot' (S,D) 

Bitwise  integer  negation  D  «-  not  S. 

branches  in  the  BAM  code  and  has  the  following  syntax: 


'Scase'  (Name, Ident.CaseBody) 


where; 


CaseBody  -  (  '  Stest '  (Test , Code) 

;  'Seise'  (Code) 

)  - 

CaseBody  is  a  disjunction  of  'Stest'  goals,  terminated  with  an  'Seise'  goal.  Code  is 
any  valid  kernel  Prolog  disjunction.  Name  and  Ident  ideniify  the  test  set,  and  Test  is  a  Pro¬ 
log  predicate  (Table  4.1 1).  Test  is  (he  test  that  is  valid  along  the  branch.  For  example,  for  the 
hashing  function  it  will  be  the  goal  X«a  where  a  is  the  atom  or  structure  used  in  that  direction. 

(3)  “Dummy"  predicates.  Kernel  Prolog  does  not  allow  control  structures  (i.e.  disjunctions,  if-then- 
else,  and  negation)  in  clauses,  but  only  calls.  The  conuol  suucturcs  are  transformed  into  calls  to 
dummy  predicates,  which  arc  predicates  (hat  exist  only  in.sidc  the  original  predicate.  Dummy  predi¬ 
cates  arc  created  with  unique  names  that  arc  oerived  from  the  predicate  they  are  contained  in. 


2.2.  Concerting  standard  Prolog  to  kernel  Prolog 


The  first  stage  of  compilation  is  a  sequence  of  ^tive  source  transformations  that  converts  raw  input 
clauses  into  kernel  Prolog.  An  input  predicate  in  standard  Prolog  is  transformed  into  a  tree  that  contains  a 
kernel  Prolog  form  of  the  original  predicate  and  a  set  of  dummy  predicates  in  kernel  form  created  during 
the  transformation.  Care  is  taken  to  put  the  predicate  in  a  form  that  maximizes  opportunities  for  determin¬ 
ism  exuaciion.  The  five  transformations  arc: 

(1)  Standard  form  transformation.  Conven  the  raw  Prolog  input  to  a  convenient  standard  notation. 
This  docs  several  housekeeping  tasks:  it  properly  terminates  conjunctions  (with  true)  and  disjunc¬ 
tions  (with  fail),  and  it  converts  ncgation-as-failurc  into  if-then-elsc. 

f 

(2)  Head  unraveling.  Rewrite  the  head  of  each  clause  as  a  new  head  and  a  list  of  unification  goals  such 
that  all  the  arguments  of  the  new  head  arc  distinct  variables  and  the  head  unifications  are  unification 
goals. 

(3)  Arithmetic  transformation.  Compile  arithmetic  expressions  to  internal  arithmetic  built-ins. 

(4)  Cut  transformation.  Implement  cut  by  convening  all  uses  of  cut  and  if-then-clse  to  internal  cut 
built-ins. 

(5)  Flattening.  At  this  point  all  complex  control  has  been  convened  to  disjunctions.  Conven  nested 
disjunctions  to  dummy  predicates. 

2J.1.  Standard  form  transformation 

The  standard  fotm  of  a  clause  is  intended  to  simplify  its  syntax  so  that  traversing  it  is  as  simple  as 
possible.  The  standard  form  satisfies  the  rules  in  Table  3.2.  These  rules  are  ignored  in  the  piesenution  of 
most  of  the  examples  in  this  dissertation  because  they  make  the  examples  less  readable  (although  they  arc 
always  satisfied  in  the  compiler). 

2.2.2.  Head  unraveling 

Unraveling  the  head  of  a  clause  consists  of  rewriting  it  as  a  new  head  and  putting  a  scries  of 
unification  goals  in  the  clause’s  body  so  that  all  the  head’s  arguments  arc  distina  variables  and  all  the  head 
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Table  3.2  -  Standard  form  of  a  clause 

Rule 

Description 

1 

Conjunctions  and  disjunctions  arc  right  assocfaiivc. 

2 

Conjunctions  have  no  internal  t  rue  and  arc  terminated  by  true. 

3 

Disjunctions  have  no  internal  fail  and  are  terminated  by  fail. 

4 

Single  goals  inside  disjunctions  arc  considered  as  conjunctions  (and  therefore  rule  2  applies). 

5 

There  IS  no  negation  (it  is  converted  to  if-thcn-cisc). 

6 

Arguments  of  if-thcn-cIsc  arc  considered  as  conjunctions  (and  therefore  rule  2  applies). 

7 

(A->B)  as  a  goal  in  a  conjunction  is  convened  to  (A->B;fail). 

8 

The  first  argument  of  all  unify  goals  is  a  variable. 

unificaiions  arc  unification  goals  in  the  body.. 


If  this  is  not  done  correctly  then  much  opponunity  for  later  optimization  is  lost.  From  the  predicate's 
type  formula,  the  compiler  knows  which  head  arguments  arc  nonvariable  and  which  head  arguments  arc 
unbound.  UniAe^tion  goals  arc  created  that  satisfy  two  constraints; 

(1)  Maximize  the  number  of  nonvariabic  arguments  that  are  unified  together.  Put  these  unifications  first 
•  in  the  unraveled  clause. 

(2)  Minimize  the  number  of  unification  goals  that  contain  unbound  variables.  Pul  these  unifications  last 
in  the  unraveled  clause. 

For  example,  consider  the  clause: 

r-mode ( (« (A, B, C) : -nonvar (A) .nonvar (B) ,  var (C) ) )  . 
a (A, A, A)  atomic <A),  ... 

The  type  declaration  says  that  the  first  two  arguments  arc  nonvariablcs  and  the  third  argument  is  an 
unbound  variable.  The  argument  A  appears  three  times  in  the  head.  Therefore  there  are  three  way's  to 
unravel  this  clause;  (a  (X,Y,Z)  :-X-Y,X-Z).  (a(X,Y,Z)  :-Y-X,Y-Z).  and  (a(X,Y,Z):- 
z-x,  Z«Y) .  Considering  the  mode  declaration,  the  head  is  transformed  into  the  first  of  the  three  unraveled 
versions: 


a(A,B.  C)  A“B,  A-C,  atomic (A),  ... 

The  first  unification  A>B  is  of  two  nonvariablcs.  The  second  unification  A*C  is  of  a  nonvariabic  and  an 
unbound  variable.  This  satisfies  both  constraints. 
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expression ( (X  is  Expr) ,  Code)  expr(Expr,  X,  Code,  []). 

expr{V,  V)  — >  {var(V)|, 

expr (A,  A)  -->  ( integer (A) ) ,  !. 

®  expr(A+B,  C)  — >  expr (A,  Ta).  expr (B,  Tb) ,  ('$add' (Ta,Tb,C} ) . 

expr(A-B.  C)  — >  expr (A.  Ta).  expr(B.  Tb) ,  ( ' $sub' (Ta, Tb, C) J . 
'  expr(A*B.  C)  — >  expr(A.  Ta),  expr(B,  Tb) ,  (Ta.Tb.C) ) . 

expr (A/B,  C)  — >  expr (A,  Ta) ,  expr(B.  Tb) .  [' Sdiv' (Ta.Tb.C) J . 


Figure  3.2  -  Compiling  an  ahihmeuc  expression 


2^3.  Arithmetic  transformation 

t 

The  iSil2  predicate  is  translated  into  internal  three-argument  arithmetic  built-ins  (Table  3.1).  Fig¬ 
ure  3.2  gives  a  simplified  but  fully  functional  version  of  the  algorithm  used  to  compile  expressions.  It  han¬ 
dies  ^iirary  expressions  containing  the  four  basic  aridimetic  operations.  For  example,  the  call; 

expression (X  is  23* (Y+Z) .  Code) 
gives  the  code: 

Code  -  r$add'  (Y.Z.T)  ,  '  Smul*  {23.T.X)  J 

The  full  algorithm  handles  all  the  arithmetic  primitives  of  Table  3.1  and  does  partial  constant  folding. 

2,2.4.  Cut  transformation 

The  cut  operation  modifies  control  flow  by  removing  all  choice  points  created  since  entering  the 
predicate  containing  the  cut.  including  the  choice  point  of  the  predicate  itself.  Cut  is  implemented  by 
means  of  a  source  cansformation.  It  requires  no  support  from  the  architecture  exequ  the  ability  to  access 
and  modify  the  register  r  (b) .  which  points  to  the  most  recent  choice  point 

The  cut  transformation  is  given  in  Figure  33.  A  call  to  the  built-in  '  $cut_load'  (2f)  is  put  at 
the  entry  of  a  predicate  containing  a  cut.  This  built-in  moves  the  r(b)  register  to  X.  which  marks  the  top 
of  the  choice  point  stack  on  entry  to  the  predicate.  The  argument  X  is  passed  to  the  predicate's  body.  Each 
occurrence  of  cut  in  the  body  is  replaced  by  a  call  to  the  built-in  '  Scut '  (X) .  This  built-in  loads  r  (b) 


procedure  cut_uansformaiion; 
var  P ' :  list  of  clause; 
begin 

for  each  predicate  P  in  the  program  do  begin 
if  P  contains  a  cut  then  begin 

r  At  this  point  P  =[  C  i .....  C,  1  (list  of  clauses)  and  C.  =  (//,  B. )  */ 
Add  the  argument  X  to  all  H,  in  B ; 

Replace  each  occurrence  of  “  !  ”  in  B  by  '  Scut '  (X) ; 

B';=B: 

Add  the  predicate  B '  to  the  program; 

H  :=  (new  head  with  same  functor  and  arity  as  all  H, ); 

H'  ■=  (//  with  the  additional  argument  X); 

B  :=((//  '  $cut_load'  (X).//')] 

end 

end 


end; 


Figure  3.3  -  The  cut  transformation 


from  X.  which  restores  the  original  top  of  the  choice  point  stack.  For  example,  consider  the  predicate; 

P  q.  !,  r. 

p  s. 

This  is  transformed  into; 

p  '  $cut_load'.(X) ,  p' (X)  . 

p'  (X)  q.  'Scut' (X),  r. 

p' (X)  s. 

Compilation  then  continues  in  the  usual  manner.  This  method  is  simple  and  efficient.  Variations  of  it  have 
been  implemented  in  other  Prolog  systems  (4, 13,451.  This  method  differs  from  these  variations  in  that  the 
compiler  does  not  always  store  the  value  of  r  (b)  on  the  environment  stack,  but.puts  it  in  a  predicate 
argument  X.  h  is  stored  in  an  environment  only  if  the  clause  is  compiled  with  an  environment. 


12S.  Flattening 


At  this  point,  all  the  complex  control  in  a  predicate  (disjunctions,  if-then-eisc,  and  negation-as- 


failure)  has  been  translated  to  disjunctions.  Flattening  replaces  the  disjunctions  by  calls  to  dummy  predi¬ 
cates.  For  example,  the  definition; 


( 
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a(X,Y)  (  bl(X,A)  ;  b2(X.B),t{B)  ),  d(Y,A). 

IS  transformed  into;  : 

a(X,Y)  ' Sflatten_a/2_1'  (X.A) .  d(Y.A). 

• Sflatten_a/2_1‘ (X, A)  :-bl{X,A). 

' Sflatten_a/2_1'  (X, A)  ;-b2(X.B).  t (B) . 

Compilation  then  continues  in  the  usual  manner  and  the  dummy  predicate  '  $£latten_a/2_l'  (X,  A) 
is  compiled  as  in-line  code.  The  dummy  predicate  is  created  with  a  unique  name  derived  from  the  name  of 
the  origiiial  predicate.  The  argument  list  of  the  dummy  predicate  is  the  intersection  of  the  set  of  variables 

used  inside  the  disjunction  and  the  set  of  variables  used  outside  it.  In  this  example  the  argument  list  is  the 

« 

interscctionof*  (X,  Y,  A}  and  (X,A,B},  which  is  fx,A). 


i 


3.  The  Berkeley  Abstract  Machine  (0AM) 


The  foundation  of  the  efficiency  of  the  compiler  is  its  execution  model,  the  BAM.  The  BAM  has 
been  designed  to  suppon  all  compiler  optimizations  and  to  make  the  system  easily  retai^etable  to  the 
VLSl-B  AM  and  general-purpose  machines.  The  design  evolved  by  interaction  with  the  development  of  the 
compiler,  the  architecture  design  of  the  VLSl-BAM  processor,  and  the  requirement  of  portability  to  other 
architectures.  The  BAM  was  developed  in  tandem  with  the  VLSl-BAM  processor,  but  the  two  instruction 
sets  arc  quite  different.  The  VLSl-BAM  is  coiistiained  by  its  hardware  implementation;  the  BAM  evolved 
by  looking  at  the  requirements  of  Prolog  and  is  designed  to  allow  a  great  deal  of  low-level  optimization. 

The  Aquarius  compiler  uses  a  simple  output  language  and  not  an  existing  high-level  language  such 

* 

as  C  or  an  existing  low-level  language  such  as  an  assembly  for  a  particular  machine.  There  arc  several  rea¬ 
sons  for  this; 

(1)  Choosing  an  existing  language  requires  choosing  representations  for  tags  and  data  structures,  and 
writing  frequently  used  Prolog-specific  operations  as  subroutines.  This  is  undesirable  for  two  rea¬ 
sons;  First,  the  VLSl-BAM  is  one  of  the  target  machines  and  its  architecture  has  a  more  abstract 
representation  for  tags  and  Prolog-specific  operations  than  general-purpose  processors.  Second, 
these  representations  are  not  necessarily  the  best  for  all  machines. 

(2)  Choosing  an  existing  high-level  language  is  unsatisfaaory  for  the  VLSl-BAM  processor  since  the 
only  compiler  for  it  is  currently  the  Aquarius  compiler. 

(3)  An  unpredictable  factor  is  introduced  when  doing  performance  evaluations.  The  performance  on  dif¬ 
ferent  machines  varies  depending  on  the  sophistication  of  the  implementation  of  the  existing 
language.  It  is  not  always  easy  to  determine  the  performance  of  the  existing  language  from  inspec¬ 
tion  of  its  source  code. 

The  syntax  and  semantics  of  the  BAM  is  presented  at  several  levels  of  detail,  from  a  discussion  of  its 
features  in  English  down  to  a  detailed  formal  specification  of  its  semantics  in  Prolog.  The  body  of  the 
dissertation  defines  the  dau  t>’pes  of  the  BAM,  gives  an  overview  of  its  instruction  set,  and  justifies  the 
choice  of  iasiructions.  Appendices  B  and  C  give  formal  specifications  of  BAM  syntax  and  semantics,  and 
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Appendix  D  gives  a  concise  but  complete  English  description  of  BAM  semantics. 

This  section  has  four  pans.  The  first  pan  presents  the  data  types  of  the  BAM.  The  second  pan  sum¬ 
marizes  the  BAM  instruction  set.  The  instruction  set  consists  of  four  pans:  simple  instructions  (tagged 
load-store  architecture),  complex  instructions  (Prolog-specific  operations),  pragmas  (embedded  informa¬ 
tion  to  allow  better  translation  to  a  real  machine),  and  user  instructions  (intended  to  allow  the  complete 
run-time  system  to  be  written  in  BAM).  The  third  pan  justifies  the  complex  instructions.  The  fourth  pan 
justifies  the  instructions  needed  to  implement  unification  by  showing  how  they  are  consiiuaed  Crom  a 
unification  algorithm  given  a  few  simple  assumptions  about  the  architecture. 

3.1.  Data  types  jn  the  BAM 

« 

The  data  types  of  the  BAM  are  classified  into  two  groups:  the  types  used  during  execution  and  the 
types  used  to  represent  instructions  (Table  3.3).  The  BAM  has  four  data  types  that  are  used  during  execu¬ 
tion:  words,  natural  numbers,  symbolic  labels,  and  mappings.  These  are  denoted  as  the  set  of  all  words  W. 
the  set  of  natural  numbers  N,  the  set  of  mappings  M.  and  the  set  of  symbolic  labels  L.  A  word  is  a  pair 
T*N  where  T  is  the  tag  and  N  is  the  value.  A  natural  number  is  a  nonnegative  integer.  A  mapping  (not 
sho\x'n  in  Table  3.3)  is  a  correspondence  between  a  set  of  objects  and  their  values  (which  are  often  words). 
A  symbolic  label  marks  a  position  in  the  program. 

Several  definitions  in  Table  3.3  require  some  clarification.  Sets  are  denoted  by  bold  capital  letters, 
variables  by  capital  letters,  and  constants  by  lower  case  letters.  Addressing  modes  are  defined  recursively, 
with  a  base  case  consisting  of  registers  and  atomic  tenns,  and  a  recursive  case  consisting  of  three  pans:  tag 
insertion  (T*x),  indirection  ( (X] ),  and  offset  ( (X-t-N) ).  The  BAM  uses  only  a  subset  of  the  infinite  set  of 
addressing  modes  defined  here.  Of  all  the  internal  roisters  of  the  BAM,  only  the  argument  registers 
r  (I),  the  heap  pointer  r  (h),  and  the  backtrack  pmnier  c(b)  are  visible  in  the  instruction  set  Appen¬ 
dix  B  gives  a  precise  definition  of  instruction  syntax  including  the  addressing  modes  that  are  actually  used. 
The  meaning  of  the  instructions  is  defined  informally  in  section  3.2  and  formally  in  Appendix  C. 

A  term  can  be  of  arbitrary  size.  A  term  that  fits  completely  in  a  register  is  called  simple.  All  other 
terms  are  called  compound.  A  register  cannot  store  all  possible  terms,  but  it  can  contain  encoded  informa- 


Table  3.3  -  Types  in  the  BAM 

Types  used  during  execution 

Name 

;  Definition 

Word 

W 

= 

(  T  ■  N  1  T  €  Tp  A  natural(N)  )  u  A 

Symbolic  label 

L 

= 

(  fail  )  Kj  {  F/N  .  1(F/N,  I)  1  atom(F)  a  natuia!(N)  a  natural(I)  ) 

Natural  number 

N 

Atomic  term 

A 

= 

(  tatm'V  1  atom{V)  v  (V=(F/N)  a  aiom(F)  a  natural(N)))  w 

{  V  1  inicgcr(V)  )  { tflf  V  1  floai(V)  ) 

T ypes  u.scd  to  represent  instructions 

Name 

Definition 

Tag 

T 

= 

(tvar.  1 1st.  tstr.  tatm,  tint,  tpos,  tneg,  tf  It )  =Tp  u  Ta 

Pointer  ug 

Tp 

= 

(tvar,  tlSt.,tstr) 

Atomic  tag 

T^ 

= 

{tatm.  tint,  tpos,  tneg, tf  It) 

Condition 

C 

= 

(eq,  ne.  Its,  les.gts.ges) 

Equality  condition 

C. 

= 

(eq,  ne) 

Arithmetic  operation 

E 

= 

(add.  sub, mul,  div.mod,  and,  or,  xoc,  sll,  sra) 

State  register 

R, 

= 

(r  (b) .  r  (b> .  r  (e) .  r  (hb) .  r  (pc) .  r  (cp) .  r  (tmp_cp) .  r  (tr)  ) 

Argument  register 

R« 

= 

I  r  ( I)  1  natural(I)  ) 

Permanent  register 

Rp 

= 

(  p  ( I )  1  natutal{I)  ) 

Addressing  mode 

X 

= 

A  o  R„  u  {r  (h).  r  (b) )  u  {  T*X  1  T €T,  a  XeX  )  u 

{  (X]  1  X  eX  }  u  {  X+N  1  X  eX  A  natural(N)  ) 

Instruction 

1 

(The  set  of  BAM  instructions  is  defined  in  section  3.2  and  Appendix  B) 

tion  about  a  term.  The  tag  of  a  term  stored  in  a  register  is  the  information  about  the  term  that  is  indepen¬ 


dent  of  the  lean’s  location  in  memory  and  can  be  obtained  without  doing  a  memory  reference.  The  value 
of  a  term  in  a  register  tells  where  to  find  the  rest  of  the  term.  A  register  is  partitioned  into  two  fields  which 
contain  the  tag  and  the  value  of  a  term. 

The  encoding  of  information  in  ugs  is  designed  to  simplify  common  operations.  It  is  similar  to  the 
encoding  used  in  the  WAM  (Figure  2.S).  Atoms  are  represented  as  immediate  values  with  a  tatm  tag. 
Integers  arc  represented  as  themselves,  and  are  considered  to  have  tint,  tneg,  or  tpos  tags  for  the 
conditional  braiKhes  that  look  at  tags.  Unbound  variables  are  represented  as  pointers  with  a  tvac  tag 
(hat  point  to  themselves  or  another  unbound  variable.  Strucuires  and  lists  are  represented  as  pointers  with 
tags  tstc  or  tlst.  They  point  to  a  contiguous  block  of  (heir  arguments  on  (he  he^.  The  main  functor 
and  arity  of  a  structure  are  stored  there  encoded  in  a  single  word.  The  main  functor  and  arity  of  a  list  (cons 
cell)  are  not  stored  since  they  arc  known  implicitly. 


The  BAM  defines  five  mappings  to  represent  and  access  all  data  su-uctures  used  during  execution 
(Table  3.4).  These  mappings  arc  the  Register  Set,  the  Heap,  the  Trail,  the  Code  Space,  and  the  Label  Map. 
An  infinite  number  of  argument  and  permanent  registers  is  assumed  to  exist  Of  all  registers,  only  the  heap 
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Table  3.4  -  Run-time  data  structures  of  the  BAM 

Name 

Definition 

Register  Set 

(Rx  c/  Ro  Rp )  -»  W 

Heap 

W  -»  W 

Trail 

N  ->  W 

Code  Space 

N  ->  I 

Label  Map 

L  -»  N 

pointer  r(h)  and  the  backtrack  pointer  r(b)  arc  made  explicit  in  the  instruction  set.  The  others  arc 

implicit  in  its  execution.  Environments  and  choice  points  arc  represented  as  register  sets  that  are  stored  in 

registers  r  (e)  and  r  (b) .  respectively. '  Prolog  terms  are  stored  in  registers,  on  the  heap,  and  on  the 

trail.  Compound  terms  are  stored  on  the  heap  as  sequences  of  words  in  the  same  manner  as  is  done  in  the 

WAM  (Figure  2.S).  For  ail  types  excq>t  atoms,  the  value  field  of  a  word  is  a  natural  number  that  indexes 
« 

into  the  heap,  and  therefore  points  to  terms  on  the  heap.  For  atoms,  the  value  field  is  the  symbolic  atom 
itself.  The  correspondence  between  tags  and  Prolog  data  types  is  given  in  Table  3.S. 


Table  3.5  -  Correspondence  of  tags  with  Prolog  data  types 

Tag 

Data  type 

tvar 

An  unbound  variable  or  a  general  pointer. 

tstr 

Pointer  to  a  structure — a  compound  term  with  a  functor  and  fixed  number  of  arguments. 

tlst 

Pointer  to  a  cons  cell — z  compound  term  consisting  of  two  parts,  a  head  and  a  tail. 

tatm 

An  atom. 

tpos 

A  nonnegative  integer. 

tneg 

A  negative  integer. 

tint 

An  integer. 

tflt 

A  floating  point  number. 

The  following  descriptions  clarify  the  correspondence  between  BAM  types  and  Prolog  types: 

(1)  The  value  corresponding  to  a  pointer  tag  is  an  index  into  an  array  of  words.  This  is  normally  imple¬ 
mented  as  an  address. 

(2)  The  value  corresponding  to  a  tatm  tag  is  a  symbol  that  uniquely  identifies  an  atom  or  the  main 
functor  of  a  structure.  It  is  a  Prolog  atom  or  a  Prolog  structure  of  the  form  F/N  where  F  is  a  Prolog 
atom  representing  the  functor  and  N  is  a  nonnegative  inusger  representing  the  arity.  For  correctness, 
the  assembler  and  run-time  system  must  guarantee  an  exact  correspondence  between  this  symbol  and- 
thc  contents  of  the  run-time  symbol  table,  so  that  the  built-ins  naTne/2,  £unctoc/3,  arg/3. 


and  «.. /2  all  work  correctly. 
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(3)  The  value  corresponding  lo  a  tpos  or  tneg  tag  is  a  nonnegative  integer  that  represents  the  abso¬ 
lute  value  of  the  integer  represented  by  the  word. 

* 

a 

(4)  The  value  corresponding  to  a  tint  tag  is  an  integer  that  represents  the  value  of  the  integer 
represented  by  the  word. 

(5)  The  value  corresponding  to  a  tf  it  ug  is  a  floating  point  number  that  represents  the  value  of  the 
number  represented  by  the  word. 

Nothing  is  assumed  about  how  these  types'are  represented  on  a  real  machine.  When  the  BAM  is  targeted 
to  a  real  machine  then  the  representation  of  types  on  the  machine  must  be  defined.  The  representation  of 
types  changes  with  different  target  machines,  different  versions  of  the  system,  and  even  different  programs. 

t 

The  Implementation  Manual  [31]  discusses  how  to  pon  the  BAM.  Symbolic  labels  are  pointers  to  code. 
Since  mappings  can  be  of  any  size,  they  are  pomters  to  dau  stacks  in  memory.  The  representation  of  a 
word  depends  on  the  encoding  used  to  represent  tags  on  the  machine,  the  word  size  of  the  machine,  and  on 
the  encoding  of  Prolog  atoms  into  unique  bit  patterns.  For  the  VLSl-BAM  processor,  all  four  types  are 
mapped  into  32  bits  and  words  consist  of  4  bit  tags  and  28  bit  values. 


Tabic  3.6  -  Notation  for  arguments  of  BAM  instructions 

Type 

X,  y,  2 

L,  LI,  L2,  L3 
N 

A 

Addressing  modes,  elements  of  X.  Most  instructions  use  a  subset  of  all  possible 
addressing  modes. 

Branch  destinations,  elements  of  L. 

A  natural  number,  element  of  N. 

A  Prolog  atom,  element  of  A. 

Tag 

Eq 

Cond 

Op 

RegList 

A  tag  value,  element  of  T. 

An  equality  condition,  element  of  C«. 

A  condition,  element  of  C. 

An  arithmetic  operation,  element  of  E. 

A  list  of  registers  used  in  choice  point  management 

RegLiate  ( (oo.a].....a,)  l/ieN.o^e  {i,no)  ). 

3.2.  An  overview  of  the  BAM 


The  BAM  uses  types  and  data  suucuircs  similar  to  the  WAM.  It  has  registers  and  stacks  similar  to 
the  WAM  and  uses  a  similar  execution  strategy.  However,  the  instruction  set  is  completely  different.  The 
BAM  has  a  load-store  ittstruction  set  that  is  extended  with  tagged  addressing  modes  and  a  few  primitive 
Ploiog-spocific  instructions.  A  summary'  of  the  addressing  modes  and  instructions  is  given  in  Tables  3.6 
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through  3.10.  All  instructions  use  only  a  subset  of  the  addressing  modes  given  in  Table  3.3.  The  instruc¬ 
tion  SCI  includes: 

•  Simple  instructions  (Table  3.7).  These  are  simple  register-transfer  level  operations  for  a  tagged 
architecture.  They  include  move,  push,  conditional  branch,  and  arithmetic.  These  instructions  are 
used  to  implement  many  cases  of  unification  and  many  built-in  predicates. 

•  Comple.\  instructions  (Table  3.8).  There  arc  five  frequently-used  operations  defined  as  single 
instructions:  dereferencing  (following  a  pointer  chain  to  its  end),  trailing  (saving  a  variable’s  address 
so  it  can  be  restored  on  backtracking),  general  unification  (when  the  compiler  cannot  simplify  the 
general  case),  choice  point  handling  (saving  and  restoring  state  for  backtracking),  and  environment 
handling  (creating  and  removing  local  stack  frames). 

•  Embedded  information  (Table  3.9).  This  allows  a  better  translation  to  the  assembly  language  of  the 
•  target  machine.  This  information  is  expressed  in  two  ways:  (1)  with  pragmas,  which  resemble 

instructions  but  arc  not  executable,  and  (2)  by  extending  instructions  w-ith  additional  arguments.  An 
example  of  (1)  is  the  tag  pragma,  which  gives  the  tag  of  a  load  or  a  store,  e.g.: 

pragma  (tag (r.d ), tvar) ) .  %  Register  r(l)  contains  a  tvar  tag. 
move ( [r (1 ) 1 , r (0) ) .  %  Load  register  r(0)  from  register  r(l). 

By  giving  the  tag  at  compile-time,  this  avoids  lag  masking  on  a  general-purpose  processor  and 
allows  the  load  to  be  done  in  a  single  cycle.  An  example  of  (2)  is: 

unify (r  (0) , r (1) nonvar. fail) .  %  Register  r(l)  is  nonvariable. 

This  gives  no  information  about  r(0)  but  says  that  r(l)  is  nonyariable.  This  allows  the 
unification  to  be  done  more  efficiently  because  no  check  has  to  be  done  whether  r  ( 1 )  is  unbound. 

e  User  instructions  (Table  3.10).  The  BAM  language  is  extended  with  several  instructions,  registers, 
and  tags  that  arc  never  output  by  the  compiler,  but  arc  intended  for  use  only  by  a  BAM  assembly 
programmer.  This  allows  the  non-Prolog  component  of  the  run-time  system  to  be  written  completely 
in  BAM  assembly.  Thc.se  instructions  arc  described  in  Appendix  D. 
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Table  3.7  -  Simple  insuuctions 

Instruction 

Meaning 

equal (X, Y,  L) 

Branch  to  Ix  if  X  and  Y  are  not  equal. 

move{X,Y) 

Move  X  to  Y. 

push(X,Y,N) 

Push  X  on  stack  with  stack  pointer  Y  and  post-increment  N. 

Op(X,Y,Z> 

Perform  the  arithmetic  operation  Op  on  X  and  Y  and  stote  the 
result  in  Z.  Trap  if  an  operand  or  the  result  is  not  integer. 

adda (X, Y, Z) 

Full-word  non-trapping  add  of  a  word  X  and  an  offset  Y,  giving 
a  word  Z. 

pad (N) 

Add  N  to  the  heap  pointer. 

switch  (Tag,  X,  LI,  L2,  L3) 

Three-way  branch;  branch  to  LI,  L2.  L3  depending  on  whether 
she  tag  of  X  is  t  va  r.  Tag,  or  any  other  value. 

test (Eq, Tag, X,  L) 

Branch  to  L  if  the  ug  of  X  is  equal  or  not  equal  to  Tag. 

hash(T,X,N,L) 

Look  up  X  in  a  hash  tabic  of  length  N  located  at  L.  If  X  is  in 

* 

the  table  then  branch  to  the  label  in  the  ubie,  else  fall  through. 

Te  {atomic,  structure). 

pair (E, L) 

A  hash  table  entry.  E  is  either  an  atom  or  a  pair  functor/arity. 

jump  (Cpnd,  X,  Y,  L) 

Jump  to  L  if  the  arithmetic  comparison  of  X  and  Y  is  true.  Trap 
if  an  operand  is  not  integer. 

jump(L) 

Jump  unconditionally  to  L. 

label (L) 

L  is  a  branch  destination. 

procedure (Name/Arity) 

Mark  the  beginning  of  a  procedure. 

call (Name/Arity) 

Call  the  procedure  Name/Arity. 

jump (Name/Arity) 

Jump  to  the  procedure  Name/Arity. 

return 

Return  from  a  procedure  call. 

simple_call (Name/Arity) 

Non-nesiable  call  used  to  interface  with  routines  wiiucn  in 
BAM  assembly. 

simple_return 

Non-ncstable  return  used  for  routines  wriuen  in  BAM  assembly. 

3J.  Justification  of  the  complex  instructions 


The  execution  of  Prolog  requires  five  complex  operations:  dereferencing,  trailing,  unification,  back¬ 
tracking,  and  environment  management.  These  operations  are  represented  as  single  instructions  in  the 
BAM.  In  the  WAM,  dereferencing,  trailing,  and  unification  are  done  implicitly  by  many  instructions  even 
when  they  arc  not  needed.  Making  them  explicit  allows  the  compiler  to  minimize  their  use  as  much  as  pos¬ 
sible  by  doing  them  only  when  they  are  really  needed. 

The  complex  instructions  could  be  expanded  into  sequences  of  simple  instructions;  however,  this 
expansion  is  not  done  at  the  BAM  level  but  is  delayed  to  the  ntachinc  level.  There  arc  two  reasons  for  this; 

(1)  Some  machines  may  implement  pan  or  all  of  a  complex  insuuciion  directly.  Expanding  it  into  sim¬ 
ple  instructions  is  therefore  premature  since  it  would  make  this  harder  to  detect  For  example,  the 
VLSI-BAM  processor  has  suppon  for  some  complex  instructions  (c.g.  dereferencing,  trailing,  and 


unification). 


Table  3.8  -  Complex  instructions 

Instruction 

Meaning 

deref (X, Y) 

Dereference  X  and  store  result  in  Y. 

trail (X) 

Push  X  on  the  trail  stack  if  the  trail  condition  is  satisfied. 

unify  (X,Y,T.x,Ty,L) 

General  unification  of  X  and  V,  branch  to  L  if  fail.  Trailing  is 
done  by  this  instruction.  The  extra  parameters  Tx,  Ty€  {?. 
var.  nonvac)  give  information  to  improve  the  translation. 
They  arc  not  needed  for  correctness. 

unify  atomic (X, A, L) 

Unify  X  with  the  atom  A  and  branch  to  L  if  fail.  No  trailing  is 
done  by  this  instruction. 

allocate  IN) 

Create  an  environment  of  size  N  on  the  local  suck. 

deallocate (N) 

'Remove  the  top-most  environment  from  the  local  stack. 

choice (l/N, RegList, L) 

Create  a  choice  point  conuining  the  registers  listed  in 

RegList  and  set  the  retry  address  to  L. 

choice II/N, RegList, L) 
{l<I<N) 

Restore  the  argument  registers  listed  in  RegList  from  the 
cturent  choice  point,  and  modify  the  retry  address  to  L. 

choice  W/N, RegList, fail) 

Restore  the  argument  registers  listed  in  RegList  from  the 
current  choice  point,  and  pop  the  current  choice  point  from  the 
choice  point  suck. 

fail 

Restore  the  machine  state  (except  the  argument  registers)  from 
the  most  recent  choice  point,  restore  to  unbound  all  variables 
on  the  trail  that  were  bound  and  trailed  since  the  creation  of 
this  choice  point,  and  transfer  control  to  the  retry  address. 

move  I  r  (b) ,  X) 

Move  the  b^ktrack  pointer  to  X.  This  must  be  done  at  the  en- 

cut (X) 

try  of  any  predicate  conuining  a  cut 

Make  the  dioice  point  pointed  to  by  X  the  new  top  of  the 
choice  point  suck. 

■  Table  3.9- 

Embedded  information  (pragmas) 

Instruction 

Meaning 

pragma  (align  (X,N} ) 
pragma (tag (X, Tag) ) 
pragma (push (term(N) ) ) 
pragma (push (cons) ) 
pragma (push (structure (N) ) ) 
pragma (push (variable) ) 
pragma (hash_length (N) ) 

The  contents  of  location  X  are  a  multiple  of  N. 

The  contents  of  location  X  have  ug  Tag. 

A  tenn  of  size  N  is  about  to  be  created  on  the  heap. 

A  cons  cell  is  about  to  be  created  on  the  he^. 

A  structure  of  arity  N  is  about  to  be  created  on  the  heap. 

An  unbound  variable  is  about  to  be  created  on  the  heap. 

A  hash  ubie  of  length  N  is  about  to  be  created. 

(2)  For  best  performance,  optimizations  should  be  done  at  all  levek.  The  BAM  level  makes  ceruin 
I  optimizations  easy,  e.g.  the  determinism  optimization  in  Chapter  6.  Keeping  the  complex  operations 

as  single  instructions  allows  them  to  be  optimized  directly.  For  example,  if  a  variable  is  derefer¬ 
enced  twice  then  the  second  dereference  can  be  removed.  This  is  much  harder  to  detect  if  the 
I  dereference  insu’uction  is  expanded  into  a  loop. 


It  is  best  to  avoid  assumptions  about  the  characteristics  of  the  target  machine.  In  the  cases  where  such 
assumptions  would  be  useful,  die  BAM  uses  pragmas  to  give  the  information  without  compromising  the 
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Table  3.10  -  User  instructions 

Instruction 

Meaning 

ord(X, Y) 

Extract  the  value  of  X  and  move  it  to  Y. 

val(T,X,Y) 

Create  the  word  Y  from  the  tag  T  and  the  value  X. 

jump_reg (R) 

lump  to  address  stored  in  register  R. 

jump  nt (Cond, X, Y, L) 

Jump  to  L  if  the  full  word  comparison  of  X  and  Y  is  true. 
Never  trap. 

Op_nt  <X,  Y,  Z) 

Perform  the  full  word  arithmetic  operation  Op  (except  multi¬ 
ply  and  divide)  on  X  and  Y  and  store  the  result  in  Z.  Never 
trap. 

trail_bda (X) 

Push  address  X  and  the  value  stored  there  on  the  trail  stack  if 
.the  trail  condition  is  satisfied.  This  is  a  special  trail  instruction 
for  backtrackable  destructive  assignment. 

machine  independence.  The  translator  is  free  to  use  or  ignore  this  information. 


3.4.  Justification  of  the  instructions  needed  for  unification 

• 

This  section  constructs  the  BAM  instructions  that  contain  the  required  instructions  and  addressing 
modes  to  support  unification.  It  turns  out  that  both  simple  and  complex  instructions  are  necessary  to  sup¬ 
port  unification.  The  instructions  are  constructed  starting  from  an  algorithm  for  unification  and  a  very  gen¬ 
eral  intermediate  language.  The  algorithm  is  decomposed  into  specialized  instructions  depending  on  the 
form  of  the  data  known  at  compile-time. 

The  two  starting  points  arc  (I)  an  algorithm  for  unification  (a  specification  of  a  unification  algorithm 
is  given  in  Appendix  C),  and  (2)  a  very  general  instruction  set  The  method  proceeds  in  a  top-down 
manner  by  decomposing  the  unification  algorithm  into  specialized  instructions  depending  on  information 
about  the  form  of  the  dau  known  at  compilc-time  (Figure  3.4). 

This  method  is  inspired  by  Kursawe  [41]  and  Holmer  [32].  Kursawe  applies  partial  evaluation  and 
specialization  in  a  top-down  manner  starting  from  a  Prolog  program  and  obtains  an  instruction  set  resem¬ 
bling  the  WAM.  Holmer  describes  several  techniques  for  the  automatic  design  of  instruction  sets,  of  which 
decomposition  is  one.  To  go  beyond  the  WAM  it  is  necessary  to  make  assumptions  about  the  archiicaurc. 
a  step  that  Kursawe  does  not  take.  The  design  of  the  BAM  starts  with  a  general  instruction  set  that  does 
make  these  assumptions. 


The  choice  of  what  general  instruction  set  to  stan  with  is  important.  It  is  not  useful  to  start  with  an 
insuuction  set  that  has  too  little  expressive  power,  for  example  one  with  a  limited  set  of  addressing  modes. 
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Decomposition 


Figure  3.4  ~  Decomposition  of  unification 

because  the  required  addressing  modes  are  not  yet  known.  Prematurely  decomposing  complex  instructions 

into  simple  ones  side-steps  the  icsulis. 

The  following  assumptions  are  made: 

(1)  The  architcaure  is  sequential  and  of  Von  Neumann  deagn  with  multiple  registers. 

(2)  The  basic  data  element  is  a  word,  which  is  large  enough  to  contain  an  address.  A  register  holds  one 

word. 

(3)  The  instructions  have  three  parts: 

•  An  action.  Some  sample  actions  are  data  movement  (move,  push),  conditional  branching 
(equal),  and  general  unification  (unify).  Other  important  actions  are  multi-way  branching 
(switch)  and  several  Prolog-specific  operations  (decef .  trail). 

•  A  set  of  arguments.  Unification  acts  on  two  operands,  so  typically  two  a^uments  are 
sufficient. 

•  A  set  of  destination  addresses.  Depending  on  the  outcome  of  the  aakm,  control  continues  at 
one  of  the  destinations.  The  size  of  the  set  and  the  meaning  of  its  members  depends  on  the 
action.  The  address  of  the  next  instruaion  in  the  insmiaion  stream  is  an  implicit  member  of 


(4)  Argumcnis  arc  referenced  with  multiple  addressing  modes.  An  infinite  set  of  addressing  modes  arc 
defined  in  Tabic  3.3.  The  instructions  derived  in  this  section  will  need  only  6nitc  subset.  For  clarity. 
Table  3. 1 1  gives  some  abbreviations  useful  for  this  subset. 


Table  3.1 1  -  Useful  abbreviations 

Notation 

Meaning 

Disp 

Offset 

Imm 

Var 

Arg 

a  positive  heap  displacement  (bounded  by  the  size  of  a  term), 
a  nonnegative  offset  into  a  structure  (bounded  by  the  arity). 
an  immediate  value;  an  atom  or  a  numeric  constant, 
a  variable  local  to  a  clause,  i.e.  c  ( I )  orp(j). 
denotes  Var  or  (Var+Offset). 

Construction  of  the  instruction  set  proceeds  in  the  following  steps.  The  data  representation  has  already 
1 

•  _ 

been  fixed  (section  3.1).  The  existence  of  two  forms  of  unification  (read  mode  and  write  mode)  and  the 
need  for  dereferencing  and  a  three-way  branch  is  shown.  The  instructions  required  for  read  mode  and 
wnte  mode  arc  constructed.  Finally,  the  effects  of  variable  representation  (in  registers  or  on  the  environ¬ 
ment)  on  the  insuuction  set  arc  discussed. 


3.4.1.  The  existence  of  read  mode  and  write  mode 

The  compilation  of  the  unification  T|  =  Ta.  where  T\  and  Tjziz  two  arbitrary  terms,  is  reduced  to 
the  compilation  of  I'  =  T  where  at  compile-time  V  is  a  variable  and  T  is  any  term.  At  run-time  there  are 
two  values  of  V  that  result  in  different  actions  of  the  unification  algorithm; 

(1)  is  an  unbound  variable,  in  which  case  T  is  consuuctcd  on  the  fly  and  bound  to  V'  (this  is  called 
write  mode).  To  satisfy  the  standard  definition  of  unification,  when  T  is  bound  to  V  a  check  needs  to 
be  done  (the  occur  check)  that  T  does  not  contain  V.  Following  Prolog  implementation  convention, 
this  check  is  ignored  for  efficiency  reasons. 


(2)  V  is  a  nonvariable  term,  in  which  case  it  is  checked  that  the  form  of  V  matches  T.  and  the  algorithm 
is  invoked  recursively  for  the  term’s  arguments  (this  is  called  read  mode). 
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3.4.2.  The  need  for  dereferencing 

Unifying  two  unbound  variables  makes  one  point  to  the  other.  [>oing  this  several  times  leads  to 
pointer  chains,  with  the  common  value  of  all  the  variables  in  a  single  location  at  the  end  of  the  chain.  To 
get  a  variable's  value,  the  pointer  chain  is  followed  to  its  end,  an  operation  known  as  dereferencing.  It  can 
be  provided  as  an  addressing  mode  or  as  a  separate  insu'uction.  Making  it  an  instruction  avoids  repeated 
dereferencing.  Therefore  the  following  instruction  is  added: 

deref (Varl , Var2) 

First  Varl  is  moved  to  Var2.  Then  the  lag  of  Var2  is  checked.  If  it  is  an  unbound  variable  (tvar) 
it  reads  memory  and  a  loop  is  entered  replacing  Var2  by  the  referenced  value  while  its  tag  is  tvar  and 
its  pointer  part  is  different  from  Vac2.  A  two-argument  dereference  is  chosen  over  a  single-argument 
dereference  because  it  allows  a  more  compact  represenution  of  writc-once  variables  (Chapter  S). 

It  is  assumed  in  what  follows  that  and  T  arc  dereferenced  when  necessary,  in  particular  that  both 
the  trail  and  unify  instructions  arc  always  given  dereferenced  arguments. 

3.43.  The  need  for  a  three-way  branch 

The  code  for  a  unification  V  s  T  consists  of  three  pans:  (1)  a  check  whether  V  is  an  unbound  vari¬ 
able  or  a  nonvariablc  for  choosing  between  write  mode  and  read  mode  unification,  (2)  the  instructions  for 
read  mode  unification,  and  (3)  the  instructions  for  write  mode  unification. 

The  tag  field  is  available  directly  for  the  check  of  (1).  The  check  has  three  possible  results:  the  ug  of 
V  matches  a  known  lag  (read  mode),  the  tag  is  an  unbound  variable  tag  (write  mode),  or  the  tag  is  neither 
(failure).  This  implies  the  following  three-way  branch; 

switch (Tag.  Var,  VarLbl,  Ta^Lbl,  FailUbl) 

If  the  tag  of  Var  is  tvar  (an  unbound  variable)  then  jump  to  VarLbl.  If  the  tag  of  Var  matches 
Tag  then  jump  to  TagLbl.  Otherwise  jump  to  FailLbl.  The  failure  address  is  explicit  instead  of 
implicit  to  allow’  the  implementation  of  fast  incremental  shallow  backuacking. 
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3.4.4.  Constructing  the  read  mode  instructions 

The  general  ease  of  read  mode  unification  is  V  =  T,  where  at  compiie'time  V  is  a  variable  or  an 

argument  of  a  compound  leim,  and  T  is  a  term.  The  first  argument  of  each  instruction  is  the  value  of  V. 

Two  locations  arc  possible  for  its  value; 

Var  V  is  a  variable 

(Var+OffsetJ  V  is  an  argument  of  a  compound  term 

The  abbreviation  Arg  is  used  to  denote  one  of  these  two  addressing  modes  (Table  3.11).  The  second 

argument  and  the  action  arc  determined  by  the  compile-time  knowledge  of  T.  The  possibilities  are: 

(1)  T  is  partially  or  wholly  known  at  compile-time.  The  possible  information  known  about  T  is: 

•  T  i$  an  unbound  variable  that  has  not  yet  been  initialized,  e.g.  because  it  is  the  first  occurrence 
in  the  clause.  V  is  moved  directly  to  T. 

*  •  T  is  an  unbound  variable.  V  is  stored  to  T’s  location  in  memory. 

•  r  is  atomic.  Unification  reduces  to  a  check  that  T  and  V  have  the  same  atomic  value.  If  the 
values  do  not  match  the  unification  fails. 

•  r  is  compound.  Unification  reduces  to  a  check  that  K  has  the  correct  functor  and  ariiy,  fol¬ 
lowed  by  a  unification  of  its  arguments  with  T's  arguments.  If  V^’s  arguments  are  loaded  into 
registers  then  the  unification  can  be  compiled  recursively.  It  follows  that  arbitrarily  deep  nest¬ 
ing  of  addressing  modes  is  not  necessary  if  one  insuuction  is  added: 

move  ( (Var-i-Off  set] ,  Var) 

(2)  Nothing  is  known  about  T  at  compile-time.  The  unification  of  V  and  T  requires  a  general 

unification. 

The  following  table  of  primitive  instructions  summarizes  the  action  and  both  arguments: 
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Action 

Argument  t' 

Argument  7 

Explanation 

move 

Arg 

Var 

7  is  an  unbound  variable  (hat  has  not  yet 
been  initialized. 

move 

Arg 

[Varl 

7  is  an  unbound  variable  that  has  been 
initialized. 

equal 

Arg 

Var 

7  is  atomic  or  compound  and  its  main 
functor  is  not  known  at  compile-umc. 

equal 

Arg 

Tag* Imm 

7  is  atomic  or  compound  and  its  main 
functor  is  known  at  compile-time. 

unify 

Arg 

Var 

Nothing  is  known  about  7  at  compilc- 
time. 

The  instructions  equal  and  unify  both 'can  fail,  so  they  have  a  failure  address  as  third  argument.  The 
equal  instruction  compares  its  arguments  and  jumps  to  FailLbl  if  they  are  not  equal. 

General  unification  (unify)  is  the  most  complex  instruction.  If  the  unification  fails  it  jumps  to 
Fa  i ILbl.  Thi's  instruction  can  be  implemented  using  only  the  other  instructions.  However,  it  seems  that 
one  additional  instruction  is  useful:  a  multi-way  branch  with  a  different  destination  for  each  possible  tag 
value.  If  there  arc  many  possible  tags  this  implies  the  existence  of  a  jump  table  in  memory,  so  that  the 
instruction  must  do  a  memory  reference  before  it  can  branch.  Instead  of  using  this  instruction,  another 
approach  is  to  use  a  multilevel  tree  based  on  the  three-way  branch.  Both  approaches  are  viable  since  gen¬ 
eral  unification  is  used  rarely  in  real  programs.  According  to  measurements  done  by  Holmer  for  several 
large  programs  [33],  general  unification  takes  about  4%  of  the  total  execution  time  of  the  VLSI-PLM  [61]. 
More  than  95%  of  these  calls  have  arguments  that  are  not  compound  lenns  of  (he  same  type  and  therefore 
do  not  need  the  recursive  algorithm. 


3.4.5.  Constructing  the  write  mode  instructions 


The  general  case  of  write  mode  unification  is  V  =  T.  where  V  is  known  to  be  an  unbound  variable  at 
run-time  and  7  is  a  term.  Assume  that  the  term  7  is  created  on  a  stack  (called  the  heap)  with  a  minimal 
number  of  move  instructions.  This  assumption  forces  us  to  derive  the  form  that  a  compound  term  has  on 
the  heap.  The  following  are  the  possible  values  of  words  of  a  compound  term: 


Var  a  variable  (assumed  initialized) 

Tag'Imm  a  simple  subtezm  of  T 

Tag* (r (h) -Disp)  a  pointer  to  a  compound  subterm  of  T 


These  are  the  source  addressing  modes  for  the  move  insuuctions.  A  variable  Va  r  docs  not  have  to  be 


dereferenced  when  it  is  stored  on  the  heap  because  its  value  is  not  read.  The  destination  of  the  move 
instruction  is  a  location  on  the  heap.  This  location  can  be  addressed  either  by  a  displacement  addressing 

j 

mode  offset  from  the  heap  pointer  r  (h) .  i.e.  I  r  (h)  -Disp] .  or  by  an  auto-inctement  addressing  mode, 
i.e.  a  push  instruction.  The  BAM  uses  the  auto-increment  addressing  mode,  for  these  reasons: 

(1)  Preliminary  studies  using  exhaustive  search  [32]  show  that  with  the  VLSI-BAM  microarchitecture 
the  optimal  way  to  create  structures  in  write  mode  is  by  means  of  the  idiom  “load  register,  load 
register,  double-word  push”,  i.c.  two  registers  are  loaded  and  then  pushed  in  a  single  instruction. 

(2)  Insuuction  encoding  is  compacter.  i.e.  a  push  does  not  need  a  displacement  field. 

(3)  In  the  VLSI-BAM  architecture  the  push  instruction  is  given  a  displacement  field  anyway.  This 

< 

allows  .efficient  implementation  of  uninitialized  variables.  For  example,  a  cons  cell  whose  edr  is 
uninitialized  can  be  created  with  a  single  push  that  has  a  displacement  of  2. 

(4)  In  the  VLSI-BAM  architecture  the  use  of  a  push  instruction  allows  a  cache  optimization:  when  push¬ 
ing  a  diny  line  it  is  not  necessary  to  flush  the  line  first  (H).  This  optimization  was  first  done  in  the 
PSI-II  architecture  [52]. 

To  summarize,  to  create  a  term  on  the  heap  it  is  sufficient  to  choose  from  the  following  set  of  three  instruc¬ 
tions  (where  r(h)  is  the  stack  pointer  and  1  is  the  increment): 

pushtVar,  r(h),  1) 

push (Tag* Imm,  r(h),  1) 

push (Tag* (r (h) -Disp) ,  r(h),  1) 

It  is  also  necessary  to  bind  the  term  to  V.  This  requires  us  to  consider  the  form  an  unbound  variable  can 
take.  There  are  two  possibilities: 

(1)  V  has  not  yet  been  initialized,  e.g.  because  it  is  the  first  occurrence  in  the  clause.  The  term  is  moved 
directly  to  V'. 

(2)  V  has  been  initialized;  it  points  to  a  location  in  memory.  The  term  is  stored  in  this  location. 

These  two  possibilities  result  in  the  following  two  insuuctions: 


move (A,  Var)  store  directly  to  a  variable 

(variable  is  not  initialized) 

move (A,  [Var))  store  to  variable's  location 
(variable  is  initialized) 

The  addressing  mode  of  the  argument  A  depends  on  whether  the  tenn  is  compound  or  simple,  and  if  it  is 
simple,  whether  it  is  an  atom  or  a  variable.  This  results  in  three  possible  values  for  a: 


Var  a  simple  term  (variable) 

Tag'Inur.  a  simple  term  (nonvariable) 

Tag'r(h)  a  compound  term  (on  the  heap) 


In  addition  to  the  above  instructions,  it  is  also  necessary’  to  initialize  the  first  occurrence  of  a  variable.  One 
way  to  do  this  is: 

t 

move (tvar* (r (h) -Disp) ,  Var) 
push (Var,  r (h)  ,  1) 

With  these  instructions  it  is  possible  to  create  a  term  of  size  n  on  the  heap  in  n  pushes,  a  great  improve¬ 
ment  over  the  WAM,  which  requires  n  +/-1  stores,/-!  dereference  operations,  and  /-I  trail  checks, 
where  /  is  the  number  of  functors  in  the  term.  This  idea  was  first  proposed  by  Andi^  Marien  [44]. 


3.4.6.  Representation  of  variables 

Assume  that  the  execution  model  represents  variables  local  to  a  clause  in  an  environment,  or  stack 
frame.  There  is  a  dedicated  register  r  (e),  called  the  environment  pointer,  that  points  to  the  current 
environment  in  the  environment  stack.  Variables  local  to  a  clause  are  stored  either  in  registers  or  in  an 
environment,  so  the  notation  var  denotes  one  of  the  following  two  addressing  modes: 

r(I)  a  variable  in  a  ragister 

p(j)  a  variable  on  the  environment  staclc 

where  p(J)  is  implemented  as  an  offset  into  the  environment.  i.e.  as  Irlel+J'l  for  some  j'.  This 
implies  that  double  indirection  is  possible:  the  addressing  mode  [Var-t-offset]  is  (p(J) -toffset] 
when  Var  is  an  environment  variable.  The  double  indirection  is  avoided  by  including  one  instruction: 
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Tabic  3.12  -  Data  movement  instructions  for  unification 


_ Read  mode _ 

move(Arg,  Var) 
move(Arg,  fVarJ) 

equal (Arg,  Var,  F) 
equal (Arg,  Tag'Imm,  F) 

unify (Arg,  Var,  F) 


_ Write  mode _ 

pushfVar,  r(h),  1) 
push (Tag- Imm,  r(h),  1) 
pushfTag* (r(h) -Displ) ,  r(h),  1) 

move(Varl,  Var2) 
move (Tag* Imm,  Var) 
move (Tag* (r (h) -Disp)  ,  Var) 
move (Tag* t (h) ,  Var) 


.  move ( Var 1,  lVar2]) 
move (Tag* Imm,  [Var]) 

_ move (Tag* r (h) ,  (Var]) _ 

_ Tabic  3.13  -  Control  (low  and  other  instructions  for  unification _ 

switch  (Tag,  Var,  VarLbl,  TagLbl,  F)  three-way  branch 
jum(:)(Lbl)  join  read  and  write  mode  paths 

deref  (Varl,  Var2) _ _  dereference  a  pointer  chain 


3.4.7.  Summary  of  the  unification  instructions 


This  section  summarizes  the  BAM  instructions  necessary  to  suppon  unification.  Tables  3.12  and 
3.13  present  the  instructions.  They  use  only  a  small  finite  subset  of  the  addressing  modes  of  Table  3.3.  | 

The  following  typical  instructions  illustrate  the  meaning  of  the  notation: 


move  (tatm'axe,  r  (3) )  Movciheatom  axe  inioregisier  r(3). 

move { (r  (3) +5] ,  r  (4) )  Move  the  word  located  at  address  r(3)+5  into 

r(4). 

equaKr  (2)  ,tatm*cat,F)  If  c(2)  is  equal  to  the  atom  cat  then  fall 

through,  else  jump  to  label  F. 

unify  (p(2)  ,p(3)  ,F)  Unify  the  term  located  in  p  (2)  with  the  tenn  locat¬ 

ed  in  p(3).  Jump  to  label  F  if  the  unification 
fails. 

switch  (tatm,  r  (3)  ,V,T,F)  If  r(3)'siagis  tvar  then  jump  to  label  V.  If 

r  (3)  ’s  tag  is  tatm  then  jump  to  label  T.  Other¬ 
wise.  jump  to  label  F. 
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Chapter  4 

Kernel  transformations 

1.  Introduction 

Four  optimizing  transfoimalions  are  done  on  the  keniel  Prolog  represenution  of  programs:  formula 
manipulation,  factoring,  global  dataflow  analysis,  and  determinism  extraction.  The  goal  of  the  transforma¬ 
tions  is  to  reduce  a  single  metric:  The  total  execution  time  of  all  unifications  in  the  program.  This  metric  is 
approximated  by  the  number  of  unifications  and  by  the  size  of  the  terms  being  unified.  The  chapter  first 
describes,  the  representation  of  types  as  logical  formulas  in  the  compiler.  This  is  followed  by  a  description 
of  each  of  the  fqur  uansformaiions; 

(1)  Formula  manipulation.  The  compiler  impietnents  a  set  of  primitive  transformations  to  replace  Pro¬ 
log  code  and  types  (both  are  represented  as  logical  formulas)  with  simpler  versions  that  have  identi¬ 
cal  semantics.  The  simplicity  of  a  formula  is  defined  as  the  number  of  goals  in  the  formula.  These 
transformations  are  done  whenever  there  is  a  possibility  that  the  code  is  too  complex,  tjt.  upon  read¬ 
ing  in  a  program  and  after  other  transformations  such  as  the  determinism  transformation  (see  below). 

(2)  Factoring.  This  uansfonnation  groups  sets  of  clauses  in  a  predicate  together  if  they  have  head 
unifications  in  common.  This  reduces  the  number  of  head  unifications  and  shallow  backtracking 
steps. 

(3)  Global  dauflow  analysis.  This  stage  analyzes  the  program,  annotates  it  with  types,  and  restructures 
it  The  analyzer  uses  abstract  interpretation  to  determine  the  types  of  predicate  arguments. 

(4)  Determinbm  transformation.  This  stage  rewrites  the  program  to  make  its  determinism  explicit,  Le. 
it  replaces  shallow  backtracking  by  conditional  branching.  Many  of  the  other  transformations  in  this 
chapter  are  chosen  to  make  this  transformation  possible  more  often.  The  trattsformaiion  converts  the 
predicate  into  a  series  of  nested  case  sutements.  Sometimes  this  is  only  partially  successful;  certain 
branches  of  the  case  statements  may  still  retain  disjunctions  (OR  choices)  that  could  tKM  be  convened 


into  deterministic  code. 
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To  improve  readability,  the  examples  in  this  chapter  are  given  in  standard  Prolog  notation.  It  is  understood 
that  they  arc  represented  internally  in  kernel  Prolog. 


2.  Types  as  logical  formulas 


Throughout  the  compiler,  type  information  about  variables  is  represented  with  logical  formulas. 
During  compilation,  any  information  learned  is  added  to  the  formula,  and  deduction  based  on  the  formula 
simplifies  the  generated  code.  It  is  a  simple  and  powerful  approach  to  avoid  doing  redundant  operations  at 
run-time.  For  example,  if  a  variable  is  dereferenced  once,  then  it  should  never  be  dereferenced  again. 
Types  in  the  compiler  are  defined  as  follows; 

Definition  T:  Given  a  predicate /  in  with  main  functor/  and  arity  n , a  type off/n  isi  term 
(/ .<^2.  •  ■  •  ,Ab)  Formula)  where  the  if i,i42,  •  •  •  arc  n  distinct  variables  and 
Formula  is  a  logical  formula  (i.e.  a  Prolog  term). 

For  example,  the  type  (range  (A,B,C)  : -integer  (A)  ,var(B) ,  integer  (C) )  says  that  the  first 
and  third  arguments  of  range/3  are  integers  and  the  second  argument  is  an  unbound  variable.  The  com¬ 
piler  recognizes  all  Pirolog  type-checking  predicates  in  the  type  fonnula.  Appendix  A  gives  a  table  of  the 
types  recognized  by  the  compiler.  In  addition  to  these  types,  several  other  types  are  recognized  that  do  not 
correspond  to  Prolog  predicaids.  These  types  introduce  distinctions  between  objects  that  depend  on  the 
impiemenution  and  are  indistinguishable  in  the  language,  for  example,  the  difference  between  an  integer 
and  a  dereferenced  integer,  and  the  difference  between  an  unbound  variable  that  is  not  aliased  to  any  other 
and  an  unbound  variable  that  may  be  aliased.  The  following  types  are  recognized  that  do  not  exist  as  Pro¬ 
log  predicates; 


Internal  Type 

Description 

uninit (X) 

X  is  an  uninitialized  memory  argument. 

uninit_mem(X) 

X  is  an  uninitialized  memory  argument 

uninit_reg (X) 

X  is  an  uninitialized  register  argument 

unbound (X) 

X  is  of  one  of  the  types  uninit_mem(X) . 
uninit_reg (X) ,or  var(X). 

decef (X) 

X  is  dereferenced,  i.e.  it  is  accessible  without  follow¬ 
ing  any  pointers. 

rderef (X) 

X  is  recursively  dereferenced,  ie.  it  is  dereferenced, 
and  if  it  is  compound  then  all  its  arguments  arc  recur- 

- -  -  .  -  . 

sively  dereferenced. 
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These  types  should  not  be  given  by  the  progiaminer  since  incorrect  code  or  a  signiheant  loss  of 
efficiency  may  result  if  they  arc  used  incorrectly.  For  example,  declaring  an  argument  of  a  predicate  to  be 
of  uninitialised  register  type,  i.c.  the  argument  is  an  output  that  is  passed  in  a  register,  may  cause  a  large 
increase  in  stack  space  used  by  the  program  if  that  predicate  is  the  last  goal  in  a  clause,  because  last  call 
optimisation  is  not  possible  in  that  instance.  The  safe  approach  is  to  leave  the  use  of  these  types  up  to  the 
compiler. 

Hie  use  of  logical  formulas  to  hold  infonpation  during  compilation  can  be  contrasted  with  the  use  of 
a  symbol  table  in  a  compiler  for  an  imperative  language.!  Represeiuing  types  as  logical  formulas  has  two 

advantages  over  a  symbol  table;  (1)  They  are  riKMe  flexible  during  compiler  developmenL  The  kind  of 

< 

information  stored  in  a  symbol  table  must  be  known  when  the  compiler  is  designed.  Formulas  can  contain 
kinds  of  information  that  are  not  knovtm  during  the  compiler's  design.  (2)  They  lend  themselves  to  power¬ 
ful  .symbolic  manipulation  such  as  deduction.  Improving  the  deductive  abilities  leads  to  better  code 
without  having  to  change  any  other  part  of  the  compiler.  The  disadvantage  of  this  represenution  is  that  its 
manipulation  is  slow.  Future  versions  of  the  compiler  could  use  a  representation  that  is  faster  in  the  com¬ 
mon  cases. 

Type  formulas  arc  used  in  the  following  ways  in  the  compiler. 

(1)  Representing  type  information  known  about  a  set  of  variables.  For  example,  the  formula 
(var  (X) , atom (Y) )  means  that  X  is  an  unbound  variable  and  Y  is  an  atom.  The  user  manual 
(Appendix  A)  lists  the  types  recognized  by  the  compiler. 

(2)  Using  a  primitive  form  of  deduaion  to  simplify  the  generated  code.  For  example,  assume  the  for¬ 
mula  is  (list  (X) ,  var  (Y)  ,dece£  (Z) , — ).  To  compile  a  tun-time  check  that  X  is  a  non- 
variable,  the  compiler  first  checks  whether  this  formula  implies  nonvac  (X)  .  This  is  true  because 
list(X)  imp'ics  nonvar  (X)  ,  so  no  run-time  check  is  necessary. 

(3)  Updating  the  type  formula  when  new  information  is  learned.  After  compiling  a  goal,  the  formula  is 
updated  to  represent  the  new  knowledge  that  is  gained.  For  example,  after  executing  the  arithmetic 


t  or  counc.  Iralh  the  isfcmhlcr  and  the  nin-lune  lysicfn  uk  tundard  tymtiol  tables. 


expression  X  is  A-t-B  i(  is  known  that  X  is  an  integer,  so  the  formula  is  extended  with 
integer (X) . 

In  most  cases,  logical  formulas  are  immutable,  e.g.  when  a  variable  X  is  known  to  be  a  list  (rcprcscnied  as 
list  < XI),  that  fact  remains  true  forever.  This  is  not  true  for  all  types.  The  types  used  to  denote  unbound 
variables  (e.g.  vac(X)  and  uninit  (X) )  become  false  as  a  result  of  an  insuniiation.  This  is  also  true 
of  the  standard  order  comparisons  (e.g.  XG<Y,  X@>Y.  and  so  forth)  and  the  types  deref  (X)  and 
rderef  (X) .  The  compiler  is  careful  to  take  this  into  account  when  updating  the  type  formula. 


Tabic  4.1  -  Primitives  to  manipulate  logical  formulas  and  Prolog  formulas 

Pnmiuvc  Description 

Fi  implies  Fj 

Implication;  Succeeds  if  it  can  determine  that  there 
does  not  exist  an  assignment  to  variables  in  Fj  and 
Fj  that  causes  both  F I  and  notfFj)  to  succeed. 

Fi  ■=  simplify(Fi) 

Fi  is  a  simplification  of  F  |. 

Fi  ;=  subsumc(F,Fi) 

Fj  is  a  simplification  of  Fj,  given  that  F  is  true. 

Fi  :=  update_formula(F.Fi) 

F  2  is  the  result  of  removing  information  contradicted 
by  F  from  F  j  and  adding  F  to  F|. 

3.  Formula  manipulation 


The  compiler  implements  a  set  of  primitive  transformations  to  manipulate  formulas.  They  are  sum¬ 
marized  in  Tabic  4.1,  where  F,Fi,  and  F^  an  logical  formulas.  Each  of  these  primitives  has  two  versions: 
a  pure  logical  and  a  Prolog  version.  The  logical  version  is  used  to  manipulate  types  (see  previous  section). 
It  assumes  the  formula  has  a  purely  logical  meaning.  i.e.  that  the  operational  concepts  of  execution  order  of 
goals,  number  of  solutions,  and  backtracking  behavior  are  not  important  The  Prolog  version  is  used  to 
manipulate  kernel  Prolog  code.  It  assumes  the  formula  must  keep  Prolog's  operational  semantics. 

Implication  is  implemented  to  work  well  with  most  combinations  of  Prolog  predicates  that  are  used 


in  type  declarations.  The  following  examples  all  return  with  success: 
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Table  A2  -  Examples  of  simplification 

Formula 

Simplified  formula 
logical  Prolog: 

Comments 

(true  ;  true) 

true 

(true  ;  true) 

The  Prolog  version  is  unchanged 
unless  the  compiler  option 
same_number_solutions  is 
disabled. 

(p, fail) 

fail 

(p, fail) 

The  Prolog  version  is  unchanged 
unless  the  compiler  can  deduce  (hat 
p  has  no  side  effects  (read  /  write 
or  assert  /  retract). 

(!»p  ;  q) 

(p  ;  q) 

'  ,(Sp) 

Cut  is  logically  identical  to  true, 
but  it  must  be  retained  since  it 
modifies  backtrack  behavior  in  the 
entire  clause  containing  it 

atom(X)  implies  nonvar(X) 

X<y  implies  integer  (X) 

X<5  implies  X<10 

uninit  (X)  implies  deref  (X) 

functor  (X,  0)  implies  atomic (X) 

•  (X'»=a;  X»“b)  implies  atom(X) 

Simplification  is  done  on  standard  Prolog,  on  kernel  Prolog,  and  on  type  formulas.  Table  4.2  gives  some 
examples  to  illustrate  (he  difference  between  logical  and  Prolog  semantics.  A  single  funetkm  simpUfy(F ) 
handles  both  logical  and  Prolog  semantics  (Figure  4.1).  For  conciseness,  the  definition  of  simplify(F)  uses 
the  compound  terms  (A,B).  (A;B).  <A->B).and  i\+ih))  both  as  selectors  (to  choose  the  branch 
of  the  case  statement)  and  constructors  (in  the  calls  to  simp_siep(F )).  Tables  43  and  4.4  define  part  of  the 
definition  of  simp_step(F),  the  primitive  simplification  step.  The  complete  definition  contains  about  SO 
rules.  The  functions  subsume(F.  Fi)  and  upda(e_fotmula(F,  F|)  are  implemented  in  a  similar  way. 


function  sifflpiify(F  :  formula) :  formula; 
begin 

case  /*  decompose  the  formula  */ 

F  =  (A,  B)  ;  return  simp_step(  (simplify(A), simpUfy(B))  );  /•  and  •/ 

F  =  (A;B)  : return simp_stcp(  (simpiify(A);simplify(B))  );  rot*/ 

F  =  (A->B)  ;  return  simp_stq)(  (simplify(A)->simplify(B))  );  /*  implies*/ 
F  =  \+  (A|  :  return  simp_step(  \+  (simplify(A))  );  /*  negation  •/ 
otherwise  :  return  simp_step(F); 
end 


Figure  4.1  -  Simplification  of  a  formula 
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Table  4.3  -  SimpUficaiion  rules  (pan  of  simp_siep’s  definition) 


Rule  Condition  to  apply  this  rule 

Input  formula  Output  formula  ; 


(true, A)  A  (none) 

(A,  true)  A  (none) 

(true; A)  true  scmantics(prolog)  a  no_sidc_cffccts(A)  a  diff_sol  a  no_bind(A) 

(true;  A)  true  semantics(logical) 

(A,  fail)  fail  scmantics(prolog)  a  no_side_cffccts(A) 

(A,  fail)  fail  semantics(logical) 

(fail, A)  fail  (none) 

( f  a  i  1 ;  A ;  A  (none) 

(A->true;B)  A  scmantics(prolog)  a  succceds(A)  a  dcteiministic(A) 

(A->true;B)  A  semanticsflogical)  a  succeeds(A) 

A  fail  semanticsfprolog)  a  fails(A)  a  no_sidc_effccts(A) 

A  fail  semantics(Iogical)  a  fails(A) 


Table  4.4  ^  The  conditions  for  applying  simplification  rules 

Condition 

Description 

seman(ics(S) 

no_sidc_effects(A) 

dctcrministic(A) 

no_bind(A) 

diff..sol 

succceds(A) 

fails(A) 

Simplify  according  to  semantics  S  where  S  €  (prologjogical). 
Formula  A  does  not  have  side  effects  when  executed. 

Formula  A  gives  only  one  solution  when  executed. 

Formula  A  does  not  bind  any  variables. 

Relax  semantics  of  Prolog  to  allow  a  different  number  of  solutions. 
Formula  A  always  succeeds  when  executed. 

Formula  A  always  fails  when  executed. 

4.  Factoring 

Factoring  is  based  on  the  operation  of  finding  (he  most-specific-generalization,  or  MSG.  of  two 
terms.  Factoring  collects  groups  of  clauses  whose  heads  can  be  combined  in  nontrivial  fashion  using  the 
MSG  operation.  The  advanugc  of  factoring  is  that  it  reduces  the  number  of  unifications  performed  during 
execution.  Figure  4.2  defines  the  MSG  in  terms  of  unification.  Given  two  terms  T i  and  Tj ,  consider  the 
set  Af  of  all  terms  that  unify  with  both  of  them.  The  MSG  of  T\  and  Tj  is  the  unique  element  Tm  of  M 
which  unified  with  any  other  element  U  vKM  gives  .  Intuitively,  this  says  that  T.  contains  the  maximal 
common  information  of  T i  and  Tt. 

The  MSG  (also  called  anti-unification)  is  the  dual  operation  to  unification.  Given  two  terms, 
unification  finds  a  term  that  i.s  a  more  instantiated  case  of  each  of  the  two.  i.e.  the  most  general  common 
instance  of  the  two.  The  MSG  is  a  term  of  which  each  of  the  two  is  a  more  instantiated  case.  For  example, 
consider  the  two  compound  terms  s(A,x,C)  and  s(A,B,y).  Unifying  these  two  terms  results  in 
s  (A,  X,  y) .  The  MSG  of  the  two  terms  is  a  (A,  B,C) .  Unification  may  fail,  i.c.  the  most  general  unifier 
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function  msgC?" i  .  Ti ;  term)  :  term; 
var  ; 

Af  :  SCI  of  term; 

T„  ,U  :  term; 

begin 

M  :=  {T  \  T  unifies  with  Ti  and  T  unifies  with  Tj  ); 

Find  r«e  M  such  that  U  €.  M  :  unifyff/ .  T* )  =  T* ; 
return  T„ 

end; 

Figure  4.2  ~  The  most  ^)ecific  generalization 


is  the  empty  set.  Finding  the  MSG  never  fails.  In  the  worst  case,  the  generalization  of  the  two  terms  is  an 
unbound  variable,  which  represents  the  set  of  alt  terms.  For  example,  consider  the  two  atomic  terms  x 

t 

and  y.  Unifying  these  two  results  in  failure,  whereas  die  MSG  is  an  unbound  variable. 

Another  way  of  viewing  the  MSG  operation  is  as  an  approximation  to  the  union  of  two  sets.  Every 
term  corresponds  to  a  set  by  instantiating  the  variables  in  the  term  to  all  possible  ground  values.  In  general, 
the  union  of  two  of  these  sets  does  not  correspond  to  any  term.  The  MSG  finds  the  smallest  superset  of  the 
union  that  is  represented  by  a  term.  A  similar  property  holds  of  unification;  it  finds  the  largest  subset  of  the 
intersection  that  is  represented  by  a  term. 

For  all  arguments  of  the  predicate,  the  factoring  tiansformaiion  finds  the  largest  contiguous  set  of 
clauses  whose  MSG  is  a  compound  term.  This  set  is  used  to  define  a  dummy  predicaie  and  the  definition  of 
the  original  predicate  is  modified  to  call  the  dummy  predicate.  The  algorithm  is  given  in  Figure  4.3.  As  an 
example  of  faaoring,  consider  the  predicaie: 

h((xl_l)  . 
h(lyl_l) . 
hdl). 

The  lists  in  the  heads  of  the  first  two  clauses  are  combined:  the  MSG  of  I  x  1  _]  and  I  y  I  _]  is  I  - 
The  result  after  factoring  is: 


procedure  factoring; 

var  ' 

M  :  term; 

C,  ,C',  :  clause; 

71 .  :  list  of  clause; 

a  ,i  ,p  :  integer; 

begin 

for  each  predicate  P  in  the  program  do  begin 

/*  At  this  point  /*  =  (  Ci  ,  Cj . C«  1  (list  of  n  clauses)  •/ 

r  and  C,  =  (Hi  )  (Each  clause  has  head  H,  and  body  B, )  •/ 
for  a  ;=  1  to  ariiy(P )  do  begin* 

Partition  P  such  that  each  contiguous  group  n  s  ( C,  ,  . C,  ]  il<p  <n) 

satisfies  exactly  one  of  the  two  properties: 

1 .  Either p=q  (ji  contains  only  one  clause),  or 

2. 71  is  the  largest  group  for  which  M  =  MSG  (argument  a  of  H,)  is  compound. 

,  'T 

•  for  each  contiguous  group  ndoitp  <q  then  begin 
/•  Create  the  dummy  predicate  /*«  V 
for  i  :sp  to  q  do  begin 

C'i  :=Ci; 

•  Remove Af  from//',-; 

Add  all  variables  in  M  as  arguments  to  H 

end; 

P ,:={€% . C',  I; 

/*  Create  the  call  to  the  dummy  predicate  */ 

H  (new  head  with  same  functor  and  arity  as  P  and  M  in  argument  a ); 

Hn  '•=  (new  head  with  same  functor  and  arity  as  P%); 
for  1  1  to  arityfP )  do  if  i  then  begin 

Make  argument  i  of  H  and  //,  identical 

end; 

Replace  it  in  P  by  the  single  clause  a  (//  // j 

end 
end 
end 

end; 

Figure  4.3  -  The  factoring  transformation 


h((AlB))  h'  (B,  A) . 
htlJ). 

h'  (B,  X)  . 
h'  (B.  y)  . 

Factoring  reduces  the  number  of  unifications  done  at  run-time  in  tvi’o  ways;  (1)  compound  terms  arc  only 
created  once  during  predicate  execution,  instead  of  being  repeated  for  each  clause  (c.g.  the  list  (A  I B]  in 
the  example),  and  (2)  the  arguments  of  compound  terms  become  predicate  arguments,  vdtich  more  often 
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allows  ihc  dcicrniinisfu  uansfomiauon  jo  convcn  shallow  bactciracking  inio  deterniinistic  selection  (e.g.  the 
value  of  the  second  argumcni  of  the  predicate  h'  detennincs  the  correct  clause  directly  without  any 
superfluous  unifications).  The  following  heuristic  is  used; 

Factoring  Heuristic:  For  each  argument  in  a  predicate,  factor  the  largest  set  of  contiguous 
clauses  whose  MSG  is  a  compound  term.  Repeat  this  operation  until  no  more  factoring  is  pos¬ 
sible. 

This  heuristic  needs  refinement  in  some  cases  to  avoid  superfluous  choice  point  creation  which  may  slow 
down  cxecudon.  The  savings  of  multiple  striKturc  creation  (how  many  fewer  unifications  are  done)  should 
be  weighed  against  how  much  deterministic  selection  is  possible  in  the  dummy  predicates. 

If  the  compiler  option  sanie_order_soluCions  is  enabled  (the  default)  then  the  operational 
semantics  is  that  of  standard  Prolog,  i.e.  the  order  of  solutions  returned  on  backtracking  is  identical  to  that 
of  standard  Prolog.  Disabling  the  option  relaxes  the  semantics  of  standard  Prolog  by  also  factoring  non¬ 
contiguous  clauses  whose  MSG  is  a  compound  term.  This  may  change  the  ordering  of  solutions  on  back¬ 
tracking.  This  option  allows  experimentation  with  variations  of  standard  Prolog  semantics. 

To  illustrate  how  factoring  can  reduce  the  amount  of  shallow  backtracking,  consider  the  following  predi¬ 
cate,  which  is  part  of  a  definition  of  quicksort: 

partitiontfYILI.X, [Y|L11,L2)  :-y-<X,  partition (L.X, LI, L2) . 
partition((Y|L),X,Ll, IYIL2))  Y>X,  partition tL, X. LI.  L2) . 
partition (() ,_,  ( ) ,  [ ] )  . 

The  first  argument  of  the  first  two  clauses  can  be  factored,  resulting  in; 

partition ( (Y IL] .X, LI , L2)  partition' (L,X. LI. L2,Y) . 
partition  (n._.I).n)- 

partition' (L.X. (Y(L11.L2.Y)  Y-<X.  partition(L,X.Ll.L2) . 

partition' (L.X. LI, (YIL21 .Y)  Y>X.  partition(L.X.Ll.L2) . 

(In  the  compound  term  [Y  |  L]  the  rightmost  variable  L  is  kept  in  the  same  argument  position  and  the 
other  variable  Y  is  put  at  the  end  of  the  goal.)  The  transformation  results  in  only  a  single  unification  of 
[Y|L]  instead  of  two  in  the  original  definition.  In  the  dummy  predicate  the  comparisons  Y-<X  and 
Y>X  use  arguments  of  the  predicate,  not  arguments  of  a  compound  term.  This  makes  it  possible  to  compile 
pa  rt  it  ion  /  4  with  a  conditional  branch  instead  of  with  shallow  backtracking. 


S.  Global  dataflow  analysis 


It  is  difficult  to  obtain  information  about  a  program  by  executing  it  in  its  original  fonn,  since  the 
range  of  possible  behaviors  is  potentially  infinite,  and  even  simple  properties  of  programs  may  be  undccid- 
able.  To  get  around  this  problem,  the  idea  of  abstract  interpretation  is  to  transform  the  program  into  a 
simpler  form  which  allows  practical  analysis.  After  the  analysis  the  inverse  transformation  gives  informa¬ 
tion  about  the  original  program.  The  fundamentals  of  a  general  method  based  on  this  approach  and  its 
mathematical  underpinning  arc  explained  by  Kildall  [37]  and  Cousoi  &  Cousoi  (23].  Marriou  and  Sonder- 
gaard  (47)  give  a  lucid  explanation  of  the  basic  ideas.  This  method  has  been  studied  extensively  and 
developed  into  a  practical  tool  for  Prolog  (18.21.24,25.49.50,53,66.67,76,84]. 

The  four  sections  that  follow  summarize  the  relevant  pans  of  the  theory  of  abstract  interpretation, 
present  my  application  of  it  to  Prolog,  describe  the  analysis  algorithm  in  detail,  and  discuss  the  integration 
of  the  algorithm  into  the  body  of  the  compiler.  In  Chapter  7  an  evaluation  is  done  of  the  effectiveness  of 
the  algorithm. 

5.1.  The  theory  of  abstract  interpretation 

The  transformed  program  should  mimic  the  original  faithfully.  This  is  made  rigorous  by  introducing 
the  concept  of  descriptions  of  data  objects.  Let  £  be  the  powerset,  i.e.  the  set  of  all  subsets,  of  a  set  of  dau 
objects,  and  D  be  a  partially  ordered  set  of  descriptions.  Then  an  abstract  interpretation  is  defined  by  the 
following  condiuons; 

(1)  Ep  :£-4£,  Dp  :D-»D 

(2)  o;£-»D.y:D-»£ 

(3)  a  and  y  arc  monotonic. 

(4)  VdeD  ;d  =a(y(d)) 

(5)  VceE  ;c  <y(o(e)) 

(6)  VdeD  ;Ep(y(d))<y(Dp(d)) 
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The  operator  £/>  in  the  first  condition  describes  a  single  step  of  the  execution  of  the  program  £  as  a  state 
transformation.  Symbolic  execution  of  the  transformed  program  is  described  by  the  operator  Dp .  Except 
for  'he  conditions  given  above,  the  choice  of  Ep  and  Dp  is  completely  free.  The  choice  is  guided  by 
several  trade-offs,  for  example;  (1)  speed  versus  precision  of  the  analysis.  (2)  complexity  versus  confidence 
in  the  correctness  of  the  analysis. 

As  an  example  of  Ep  (from  Cousoi  &  Cousoi  {23]).  consider  a  program  in  an  imperative  language 
represented  as  a  graph  u'here  each  node  is  .a  simple  statement  such  as  an  assignment  or  a  conditional.  Let 
an  environment  be  defined  as  a  correspondence  between  each  variable  in  the  program  and  a  possible  value. 
Then  for'each  edge  of  the  graph  a  set  of  possible  environments  (called  a  context)  is  given.  Initially  they  are 
all  unknown.  An  application  of  £/>  transforms  all  contexts  to  their  new  values  reached  after  one  execution 
step. 

.  For  Prolog,  a  natural  choice  is  to  identify  Ep  with  the  standard  operator  T^:2®'-»2®'  which 
describes  its  procedural  semantics.  In  this  case  £  is  2*'.  where  Bp  is  the  Herbrand  universe  of  the  program 
P ,  i.e.  the  set  of  all  ground  goalst  that  can  be  constructed  using  predicates,  functors,  and  constants  of  the 
program.  Tp  does  a  single  “forward  chaining”  step  to  find  the  conclusions  that  can  be  inferred  from  a 
given  set  of  ground  goals.  Formally.  £/>  mapsany/cBp  intoTp(/)=  {A  e  Bp  :  A  Ai,---,Aoisa 
ground  instance  of  a  clause  in  P  and  { A  t. ' "  .  A„  )  c  1  ) .  In  other  words,  an  application  of  Tp  uansforms 
a  subset  of  Bp  into  a  new  subset  containing  the  new  goals  inferred  from  the  program's  clauses  given  the 
old  goals.  The  meaning  of  a  program  P  is  defined  as  lfp(7>)  (where  Ifp  =  the  least  fixpoint  operator). 
This  is  the  set  of  all  ground  goals  that  can  be  derived  from  the  program  clauses.  For  example,  consider  the 
following  program: 

nat (0) . 

nat (s (X) )  nat (X) . 

which  states  that  nat  (X)  is  true  if  X  is  zero  or  X  is  the  successor  of  a  natural  number.  The  program’s 
meaning  is: 

Y  These  are  called  "aiom'"  in  maihemaiical  logic.  To  avoid  confusioii  with  the  atom  dau  lype  in  Pfolog.  this  disscna- 
lion  utes  ihc  Prolug  icrminulug) 
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(  nat(0},  nat(s(0}).  nat (s  (s  (0) ) ) ,  nat (s (s (s  (0) ) ) ) ,  | 


which  represents  the  set  of  natural  numbers.  . 

The  second  and  third  conditions  introduce  the  operators  a  (the  abstraction  function)  and  y  (the  con- 
cretization  function).  The  operator  a:£->0  determines  the  description  corresponding  to  a  particular  set 
of  data  objecu.  The  operator  y  -.D  -»£  determines  the  set  of  data  objects  corresponding  to  a  particular 
description. 

The  fourth  and  fifth  conditions  ensure  that  a  and  y  behave  correctly  with  respea  to  each  other.  Con¬ 
dition  four  means  that  in  going  from  descriptions  to  data  objects  and  back  no  information  is  lost  Condition 
five  means  that  in  going  from  a  data  objea  to  a  description  and  back  that  the  resulting  set  of  data  objects 

t 

includes  the  original  data  object  The  sixth  condition  is  known  as  the  safeness  criterion.  It  is  necessary  to 
ensure  that  the  symbolic  execution  (through  Df)  mimics  the  execution  of  P  accurately  (through  Ef).  In 
othbr  words,  the  abstract  interpretation  gives  desorptions  that  include  all  the  data  objects  that  the  execution 
of  the  original  program  gives. 

To  illusu’ate  what  the  conditions  mean  consider  the  abstract  domain  of  signs  of  real  numbers.  The 
data  objects  are  real  numbers.  .Let  there  be  three  possible  signs  for  numbers:  •«-  (positive),  -  (negative),  and 
0  (zero).  The  set  of  descriptions  D  describes  the  possible  states  of  a  set  of  teals,  so  it  contains  ail  combina¬ 
tions  of  the  three  signs: 

D  =  l  {).(0),  0),{+.0J.(+,-.0)  ) 

According  to  the  second  condition  a  maps  a  set  of  reals  onto  its  signs,  and  y  maps  a  set  of  signs  onto  a  set 
of  reals.  For  example: 

a(  I -5  ))=(-) 

a(l-3,5))={+,-l 

y({+))=(r€  R,r>0j 

The  fourth  condition  says  that  going  from  a  sign  to  a  set  of  reals  and  back  will  give  the  same  sign.  The 
fifth  condition  says  that  going  from  a  set  of  reals  to  a  sign  and  back  will  give  a  set  of  teals  that  includes  the 
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original  scL  So  for  example: 

W=a(7({+))) 

j 

since  y(  {+))=  the  set  of  positive  reals,  whose  sign  is  {+).  and: 

{5j  c  Y(a({51)) 

since  a(  ( S  })=  {-«-).  and  y(  {-•-})  is  the  set  of  positive  reals,  which  contains  S.  In  order  to  explain  condi¬ 
tion  six,  consider  the  equation  27x37.  Here  Ep  is  multiplication  of  reals,  and  Dp  is  the  coaesponding 
operation  in  the  abstract  domain  of  signs.  The  multiplication  corresponds  to  (+)  x  {+]  in  the  abstract 
domain.  The  result  of  the  abstract  multiplication  should  be  (-t-).  since  27  x  37  =  999,  which  is  positive. 

Condition  six  is  a  formalization  of  this  requirement. 

« 

Dataflow  analysis  is  done  by  transforming  the  original  program  over  the  domain  £  described  by  Ep 
to  a  new  version  over  the  domain  D  described  by  Dp .  Then  y(  Ifp  (D/^  ))  Ofp  =  least  fixpoint  operator) 
gives  a  conservative  estimate  of  the  required  information.  Much  work  has  been  done  in  discovering  useful 
domains  D  for  particular  applications  and  efficiem  algorithms  for  finding  fixpoinis  of  Dp  [10,53]. 

5J.  A  practical  application  of  abstract  interpretation  to  Prolog 

The  implementation  of  abstract  interpreution  presented  in  this  dissertation  uses  a  very  different  Ep 
from  the  one  suggested  in  the  previous  section  by  the  formal  definition  of  Prolog’s  procedural  semantics. 
The  choice  of  Ep  used  in  the  Aquarius  compiler  closely  follows  execution  on  a  machine.  Consider  a  pro¬ 
gram  withn  predicates  P, .  The  do/nofyem  are  then -tuples  (7  j  .T2,  *"  .7.)  where  each  7)  isafunctor 
of  same  name  and  ariiy  as  P.  and  the  arguments  of  Ti  are  terms  constructed  using  data  functors  and  atomic 
terms  in  the  program  and  possibly  containing  unbound  variables.  £  is  the  powerset  of  these  data  objects. 
The  descriptions  arc  the  />  -tuples  (£  1  ,£2 .  -  -  •  .£.)  where  each  £,-  is  a  functor  of  same  name  and  ariiy  as 
P,  and  (he  arguments  of  L,  arc  constrained  to  be  on  a  given  finite  laukc.  D  b  the  set  of  these  descriptions. 
A  lattice  is  a  partially  ordered  set  in  which  eveo'  nonempty  subset  has  a  least  upper  bound  (denoted  as  the 
iub)  and  a  greatest  lower  txxind  (denoted  as  the  gib).  Each  of  the  elcmenis  of  the  lattice  corresponds  to  a 
set  of  possible  values  in  the  origittal  program.  This  lattice  is  called  an  argument  lattice,  since  it  is  used  to 
represent  the  possible  values  of  a  predicate  argument.  A  predicate  lattice  (such  as  £,)  is  the  Cartesian 
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product  of  the  lattices  of  all  tltc  predicate’s  arguments. 

The  operator  Ep  that  mirrors  execution  of  the  program  corresponds  to  a  single  resolution  step.  It  is  a 

t 

transformation  of  a  set  of  data  objects  and  an  execution  state  to  another  set  of  data  objects  and  a  new  exe¬ 
cution  state,  following  Prolog’s  depth-first  execution  semantics,  that  is.  its  left-to-right  execution  of  goals 
in  a  clause,  and  its  top-to-botiom  selection  of  clauses  in  a  predicate.  The  operator  Dp  that  mirrors  execu¬ 
tion  of  the  program  over  the  descriptions  is  similar,  except  that  the  arguments  are  lattice  values. 

If  the  conditions  of  abstract  interpretation  hold,  then  the  least  hxpoint  of  the  symbolic  execution  over 
the  lattice  i.s  a  conservative  approximation  to  the  global  information,  in  other  words  the  set  of  values  that  a 
variable  can  have  during  execution  is  a  subset  of  what  is  derived  in  the  analysis. 

The  thr<y*  sections  that  follow  describe  the  lattice  used  by  the  analysis  algoriinm.  The  first  section 
introduces  and  defines  the  lattice  elements  and  the  types  with  which  they  correspond.  The  next  section 
gives  an  example  to  show  how  to  derive  the  types.  The  last  section  summarizes  the  properties  of  the  types 
that  are  used  by  the  algorithm. 

S.2.1.  The  program  lattice 

Dataflow  analysis  for  Prolog  differs  from  that  of  statically  typed  languages  because  it  does  not  check 
types,  but  it  infers  them.  The  most  important  information  that  can  be  deduced  about  an  argument  is 
whether  it  is  used  as  an  input  or  an  output  argument  of  a  predicate,  i.e.  the  mode  of  the  argument  After  the 
mode  is  determined,  it  is  useful  to  find  its  type,  i.c.  the  set  of  values  that  it  can  have.  The  remainder  of  this 
chapter  refers  only  to  the  type  of  an  argument,  in  the  assumption  that  this  implies  the  mode  as  well.  I  have 
experimented  with  four  lattices  of  varying  complexity  in  the  analyzer,  and  the  lattice  that  is  currently 
implemented  has  been  chosen  to  give  the  most  information  while  keeping  analysis  fast. 

During  the  analysis  the  algorithm  maintains  two  lattices  for  each  predicate  in  the  program.  These 
lattices  correspond  to  the  entry  and  exit  types  of  the  predicate,  i.c.  the  value  of  the  variable  valid  upon 
entering  the  predicate  and  upon  successful  exit  from  the  predicate.  The  lattice  describing  the  entire  pro¬ 
gram  is  the  Cartesian  product  of  the  predicate  lattices. 
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any 


nonvar 


rderef 

ground  nonvar+ 

rderef 


any  value  is  possible 

^  recursively  dereferenced 

^  uninitialized 


ground+ 

rderef 


the  empty  set  of  values 
(unreachable  argument) 


Figure  4.4  -  The  argument  lattice 


.  The  argument  lattice  of  the  entry  and  exit  types  in  the  current  analyzer  is  shown  in  Figure  4.4.  In  this 
lauicc,  any  (the  top  element)  denotes  the  set  of  all  values,  impossible  (the  bouom  element)  denotes 
the  empty  set  (i.c.  this  predicate  is  unreachable  during  execution),  uninit  denotes  the  set  of  uninitial¬ 
ized  variables  (unbound  variables  that  are  not  aliased;  see  Chapter  2),  ground  denotes  the  set  of  values 
that  are  ground  (i.c.  the  term  contains  no  unbound  variables),  nonvar  denotes  the  set  of  nonvariables, 
rderef  denotes  the  set  of  values  that  are  recursively  dereferenced  (Ic.  the  term  is  dereferenced,  which 
means  that  it  is  accessible  without  any  pointer  chasing,  and  if  it  is  compound  then  all  its  arguments  arc 
recursively  dereferenced),  and  ground-t-rderef  denotes  the  set  of  values  that  arc  both  ground  and 
recursively  dereferenced. 

S22.  An  example  of  generating  an  uninitialized  variable  type 

This  seaion  gives  a  simple  example  of  the  generation  of  uninitialized  variable  types  to  give  an  idea 
of  what  abstract  interpretation  does  and  to  illustrate  the  argument  lattice.  Uninitialized  variables  arc  gen¬ 
erated  whenever  the  analyzer  deduces  that  an  unbound  variable  cannot  be  aliased  to  another.  For  example, 
consider  the  following  program  fragment: 
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pred(...)  goal <Z) ,  _ 

goal(X)  X-s(Y),  goal (Y) . 

If  Z  is  the  first  occurrence  of  that  variable  in  the  pred  (...)  clause  then  it  is  considered  a  candidate 
uninitialized  variable.  This  is  possible  because  it  is  certainly  not  aliased  to  any  other  variable.  In  the 
definition  of  goal  (X) .  if  X  is  uninitialized  then  the  argument  Y  of  the  structure  s  (Y)  may  be  con¬ 
sidered  uninitialized  as  well.  This  Y  is  passed  on  as  an  argument  to  goal  (Y) .  Therefore  both  calls  of 
goal  (X)  are  with  an  uninitialized  argumbnwso  it  is  consistent  to  give  the  argument  X  an  uninitialized 
variable  type. 

It  may  happen  that  elsewhere  in  the  program  there  is  a  call  of  goal  (X)  where  X  is  not  uniniiial- 
ized  (for  cxam'ple  it  may  be  a  compound  term,  or  it  may  be  aliased).  In  that  case,  the  assumption  that  X  is 
uninitialized  is  invalidated.  This  may  invalidate  assumptions  about  other  arguments  of  other  predicates,  so 
it  is  necessary  to  propagate  this  information.  For  correctness,  it  is  necessary  to  iterate  until  the  least 
hxpoint  is  reached.  At  that  point  symbolic  execution  of  the  program  does  not  change  any  of  the  derived 
types. 

5  J  J.  Properties  of  the  lattice  elements 

The  example  given  above  already  gives  an  inkling  of  the  relevant  properties  of  ground,  uninitialized, 
and  recursively  dereferenced  variables  that  simplify  the  analysis.  Here  is  a  more  complete  list  of  these  pro¬ 
perties; 

•  The  property  of  being  ground,  uninitialized,  or  recursively  dereferenced  propagates  through  explicit 
unifications.  The  propagauon  is  bidirectional: 

(I)  If  X  is  ground,  uninitialized,  or  recursively  dereferenced,  then  after  executing  an  explicit 
unification  with  a  compound  term  <e.g.  X«s  (A,  B)),  all  of  its  variables  (e.g.  A  and  B)  are 
ground,  all  of  the  new  variables  (c.g.  A  and  B)  arc  uninitialized,  or  all  of  the  new  variables  are 
recursively  dereferenced. 


( 


4 


4 


(2)  In  the  other  direction,  if  all  the  variables  in  the  compound  term  arc  ground,  then  X  is  ground. 
If  all  the  variables  arc  recursively  dereferenced,  then  X  is  recursively  dereferenced  if  it  was 
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previously  uninitialized. 

•  The  property  of  being  ground  is  independent  of  aliasing.  For  example,  if  X  is  ground,  then  it  remains 
ground  after  executing  the  unification  x=y .  This  is  not  true  of  recursively  dereferenced  or  uninitial¬ 
ized  variables. 

•  An  uninitialized  variable  is  not  aliased  to  any  other  variable.  Lattice  calculations  for  uninitialized 
variables  do  not  affect  each  other. 

5  J.  Implementation  of  the  analysis  algorithm 


Previous  sections  have  introduced  the  ideas  underlying  the  algorithm,  the  program  lattice  used  by  the 
algorithm,  an,c*xamplc  of  how  types  are  derived,  and  the  properties  of  the  lattice  elemems.  This  section 
gives  a  more  complete  explanation  of  the  algorithm.  The  presentation  starts  with  an  overview  of  the  data 
representation.  It  then  describes  the  algorithm,  and  finally  it  gives  a  detailed  example  of  analyrsis. 


Tabic  4.5  -  The  components  of  the  variable  set  VS 

Name  Description 

S 

The  set  of  variables  encountered  so  far  in  the  clause.  This  set 
is  important  because  any  variable  encountered  in  a  goal  that  is 
not  in  this  set  is  known  not  to  be  aliased  to  any  other,  i.e.  it  is  a 
new  variable,  and  therefore  it  is  both  uninitialized  and  derefer¬ 
enced. 

G 

The  set  of  variables  that  arc  ground.  These  variables  arc 
bound  to  terms  that  conuin  no  unbound  variables. 

N 

The  set  of  variables  bound  to  a  nonvatiable  term.  This  set  is  a 
superset  of  C. 

U 

The  set  of  variables  that  arc  uninitialized.  A  variable  becomes 
uninitialized  if  it  is  unbound  and  known  not  to  be  aliased  to 
any  other  variable.  The  symbolic  execution  enforces  this  con¬ 
straint  This  set  is  disjoint  with  N. 

D 

The  set  of  variables  that  are  recursively  dereferenced.  A  vari- 
aUe  is  recursively  dereferenced  if  it  is  bound  to  a  term  that  is 
dereferenced.  i.e.  it  is  accessible  without  any  pointer  chasing, 
and  if  it  is  compound  then  all  its  arguments  are  recursively 
dereferenced.  This  set  is  a  superset  of  U. 

SJ.l.  Data  representation 


During  analysis  the  types  arc  represented  in  two  way.s: 


(1)  As  lattice  elements.  For  each  predicate,  there  arc  two  siruaures  containing  a  lattice  element  in  each 


argument.  These  stniciures  represent  the  entry  and  exit  types  of  the  predicate.  For  example,  the 
predicate  concat  /  3  has  two  structures  which  could  have  (he  values; 

j 

entry:  concat (any. ground, uninit) 
exit:  concat (ground, ground. any) 

This  says  that  upon  entering  concat/ 3  the  second  argument  is  ground  and  the  third  argument  is 
uninitialized.  When  the  predicate  is  exited  the  first  two  arguments  arc  ground. 

(2)  As  sets  of  variables.  Type  infomuiion  can  also  be  stored  as  a  set  for  each  type  that  contains  (he 
variables  of  that  type. 

These  two  different  representations  each  have  their  advanuges.  The  lattice  representation  makes  it  easy  to 
calculate  the  tub  (least  upper  bound).  The  variable  set  representation  makes  it  easy  to  symbolically  execute 
a  clause.  i.e.  to  propagate  and  update  information  about  variables’  types  through  the  clause.  Functions  arc 
provided  to  convert  between  the  two  representations  (Figure  4.7).  For  the  lattice  in  Rgure  4.4,  there  are 
five  sets  of  variables  which  are  updated  during  the  symbolic  execution  of  a  predicate.  Conceptually  they 
are  pan  of  a  S-tuple  VS  s  (S,  G,  N.  U,  D)  that  holds  the  current  type  information  (Table  4.S). 

S  J.2.  Evolution  of  the  analyzer 

The  current  analyzer  was  preceded  by  three  simpler  versions.  The  lauice  of  the  first  analyzer 
represented  only  entry  types  and  had  three  elements:  inpossible,  uninit,  and  any.  The  second 
analyzer  added  (he  ground  type  in  the  entry  lattice  and  an  exit  lattice  of  the  same  structure.  The  third 
analyzer  added  (he  rderef  type  to  these  lattices.  The  current  (fourth)  analyzer  added  the  nonvar 
type.  Despite  not  using  a  represenution  for  variable  ali^g,  (he  third  and  fourth  analyzers  are  able  (o 
derive  many  nontrivial  rderef  and  nonvar  types.  The  added  types  are  itxlependent,  Le.  each  version 
of  the  analyzer  does  no  better  than  previous  versions  on  types  that  previous  versions  also  derive. 

The  choice  of  what  lauice  types  to  add  was  done  by  inspcaing  the  compiled  code  of  programs  and 
by  deciding  what  types  were  easy  to  derive  in  the  context  of  the  structure  of  (he  existing  anaty’zers.  Types 
were  added  that  arc  present  in  many  programs.  Measurements  show  that  having  an  exit  lattice  and  doing 
back  propagation  (sec  below)  are  essential  features  to  derive  good  ground,  rderef,  and  nonvar 


types.  A  numerical  evaluation  of  the  efficiency  of  the  analysis  (the  percentage  of  arguments  for  which 
types  are  derived)  and  the  effect  of  analysis  on  execution  time  and  code  size  is  given  in  Chapter  7. 

For  the  next  version  of  the  analyzer  the  added  types  r  list  (recursive  list,  i.e.  the  term  is  either  nil 
or  a  cons  cell  whose  tail  is  a  recursive  list),  integer,  and  ( (nonvar+deref )  or  uninit)  (the 
term  is  either  a  dereferenced  nonvariable  or  uninitialized)  are  contemplated. 


type  varsci  =  (set,  set,  set,  set,  set);  T  5-^uplc  •/ 

var  Program  :  set  of  predicate; 

L,Mri  :  mapping  predicate  -♦  lattice; 

'  L,ut  :  mapping  predicate  -» lauice; 

P  :  predicate; 

procedure  analysis; 
var  £  :  set  of  predicate; 

V'5  :  varset; 

begin 

£:={/»(  arity(£ )  =  0  )  u  (declared  entry  points); 

Initialize  with  the  types  of  the  declared  entry  points; 

Initialize  £«,  to  imposs  ible  for  all  predicate  arguments; 
while  £  ^  0  do  begin 

for  each  predicate  £  €  £  do  begin 

VS  lattice_to_varscl(£,«,y  (£ ).  P): 

VS  :=  updatc_exit(l'S ,  predicate_analyze(P  ,VS),P) 

end; 

E  :={P  \Ltniry[P]  has  changcd  oT  3  C  €  £  I  has  changed) 

end 

end; 


Figure  4.5  -  The  analysis  algorithm:  top  level 


SJ  J.  The  analysis  algorithm 

The  analysis  algorithm  is  presented  at  three  levels  of  detail.  An  English-language  description  is 
given  of  the  basic  ideas.  A  detailed  pseudocode  definition  (Figures  4  j  through  4.7)  describes  the  complete 
algorithm  at  a  high  level  of  abstraction.  Appendix  G  gives  the  implementation  in  the  compiler. 

The  algorithm  maintains  entry  and  exit  lattice  elements  for  each  predicate  argument  in  the  program. 
Analysis  proceeds  by  traversing  the  call  graph  starting  from  a  set  of  entry  points  that  have  knowm  types. 
The  entry  points  include  all  predicates  of  arity  0  and  any  entry  declarations  given  by  the  programmer 
(Appendix  A).  The  traversal  is  repeated  until  there  are  no  more  changes  in  the  lattice  values,  that  is.  until  a 
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function  prcdicatc_anaJy2c(/’  :  predicate;  t'J  ;  vjarset) :  varseu 
var  F  :  formula; 

;  array  (1  ..  n  1  of  varsci; 

C,  :  goal; 

/  .  7  ;  integer; 

begin 

r  At  this  point  =  ( Ci . C,  1  (I'St  of  n  clauses)  •/ 

for  each  non-active  clause  C  €  /*  do  begin  /*  Symbolic  execution  of  clause  C,  */ 

/*  At  this  point  C,  =  (Ga . G„ )  (conjunction  of  /i;  goals)  */ 

VS,  ;=  VS  ; 

for  j  ;=  1  to  n,  do  begin  •/*  Symbolic  execution  of  goal  Gij  •/ 
if  (G,y  is  a  unification)  then  begin 

VS,  ;=  symbol ic_unify(VSa  Ciy)  /*  Figure  4.8  •/ 
end  else  if  G,y  e  Program  then  begin  /*  G^y  is  defined  in  the  program  */ 
Litmitry  IGiy  ]  lub(Z.>«a<fy  IGiy  1,  varsct_to_latucc(VSi  •  G^y )); 
if  non-exponentiality  constraint  then  begin 
.  VS.  :=  update_cxii( VS, .  prcdicaic_analy2c(Giy .  VSi ).  Gij ) 

end 

end  else  begin  /*  Gij  is  not  defined  in  the  program  */ 

F  varset_io_type(VS.  .C,y  ); 

G,  entry_spcciali2e{Giy .  F ); 

VS,  ;=  updatc_exii(VS, ,  exit_vaisct{G,),  Gij) 
end 

end; 

VS,  ;=  back_propagatc(  VS..C,)  /•  To  obtain  more  precision*/ 

end; 

return  O  VS,  /•  Merge  the  exit  values  of  all  VS,  •/ 

I  s  I 


Figure  4.6  -  The  analysis  ^gorithm:  analyzing  a  predicate 


fixpoint  is  reached.  With  suiuble  conditions  (i.e.  all  type  updating  is  monoionic  and  types  are  propagated 
correctly)  this  fixpoint  is  the  least  fixpoint  and  the  resulting  types  give  accurate  information  about  the  origi¬ 
nal  program.  When  a  goal  is  encoutuered  during  a  uaveisal  three  things  are  done:  (1)  the  goal's  entry  lat¬ 
tice  type  is  updated  using  the  current  value  of  VS,  (2)  if  the  goal's  definition  is  part  of  the  program  then  the 
definition  is  entered,  and  (3)  upon  retuni,  the  new  value  of  VS  is  used  to  update  the  goal's  exit  lattice  type. 
A  correct  value  of  VS  is  maintained  at  all  times  during  the  traversal  of  a  goal's  definition. 

The  definition  of  the  algorithm  in  Figures  4.S  through  4.7  leaves  out  some  details  but  is  a  faithful 
description  of  the  analysis.  The  two  conditions  non-active  and  non-exponentiality  are  explained  in  the  next 
section.  The  following  sections  describe  what  happens  in  symbolic  execution  of  a  predicate  (including 
back  propagation)  and  symbolic  execution  of  a  goal. 


88 


function  updaic_cxii(V'S  i .  VSj ;  varsei;  G  ;  goal) ;  varset; 

var  :  varsei; 

begin 

/•  Calculate  now  VS  from  old  VS|  and  exit  ^52  •/ 

VS-nonvar  ;=  VSi.nonvar  u  VSa-nonvar. 

VS  .ground  ;=  VS  1. ground  u  VS2.ground; 

VS.rdcrcf  :=  (VSi.rdcrcf  r>  VSi-giound)  VS2.rdcrcf; 

VS.sofar  :=  VSj.sofar  u  wars(G); 

VS.uninii  :=  VSj.uninil  -  vari(G); 

/•  Calculate  new  exit  lattice  •/ 

Lfui  1C  1  ;=  lub(L<ut  [G  ].  Varset_lo_lauicc(VS ,  G )); 
return  VS 

end; 

function  lub(f.i ,  f.2 ;  lattice) :  lattice; 
return  (least  upper  bound  of  L  i  and  Li): 

function  lautcc_to_varset(L  :  lauice;  G  :  goal) :  varset; 
return  (varset  corresponding  to  L  using  variables  of  G ); 

function  varsct_to_Iattice(VS  :  varset;  G  :  goal) ;  lattice; 
return  (lauice  corresponding  to  VS  using  variables  of  G ); 

function  back_propagatc(VS  :  varset;  C  :  clause) :  varset; 
return  (improved  exit  varset  from  VS  using  unification  goals  of  C); 

function  varsct_to_typc(VS  :  varset;  C  :  goal) :  formula; 
return  (type  formula  corresponding  to  VS  using  variables  of  C); 

function  entry _specialize(G  ;  goal;  F  :  formula) ;  goal; 
return  (specialized  entry  point  of  G  when  called  with  type  F ); 

function  cxit_varsei(C  :  goal) :  varset; 
return  (exit  varset  stored  for  the  known  goal  G ); 

Figure  4.7  -  Utility  functions  needed  in  the  analysis  algorithm 


5  J.4.  Execution  time  of  analysis 

This  section  shows  that  the  average  analysis  time  for  programs  that  contain  only  linearly  recursive 
predicates  (i.e.  no  clauses  contain  more  than  one  recursive  call)  and  that  have  bounded  arity  is  proportional 
to  the  size  of  the  program.  The  analysis  time  Tamaiyiis  is  proportional  to  the  time  of  each  iteration  7,wr  and 
the  number  of  iterations  Niur  needed  to  reach  the  least  fixpoint; 

T«M|y<u  *  G  (  Totr  ■N„„) 

For  programs  that  conuiin  only  linearly  recursive  predicates,  the  time  of  each  iteration  is: 


7\^r  =G(5  A  ) 
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where  5  is  (he  (oul  number  of  goals  in  the  program  and  M  is  (he  maximum  number  of  limes  a  predicate  is 
traversed.  (Programs  with  non-Iincarly  recursive  predicates  arc  discussed  below.)  This  is  true  because  the 

j 

algorithm  traverses  each  clause  at  most  once  in  an  iteration.  It  assumes  that  the  symbolic  execution  of  a 
goal  whose  definition  is  not  traversed  is  a  constant  time  operation.  A  predicate  is  traversed  only  if  the 
current  entry  type  is  worse  than  the  previous  worst  entry  type.  The  number  of  times  this  situation  can 
occur  is  bounded  by  the  depth  of  (he  enuy  lattice  of  the  predicate,  which  is  proportional  to  the  maximal 
arity  in  the  program.  Therefore; 

5  =  i  i  length (C.j) 

.=1 ,=i 

‘a  =0( max  arity  fP,)) 

•  1=1 

where  the  program  contains  n  predicates,  and  each  predicate  P,  contains  n,  clauses  Qj .  The  arity  of  a  . 
predicate  is  denoted  by  arity  fPi)  and  the  number  of  goals  in  a  clause  is  denoted  by  length  fCij).  The 
number  of  iterations  is  trivially  bounded  by  the  depth  of  the  program  lattice; 

A^iicr  —OfDuiul) 
where  Duui  is  given  by; 

Duut  =  2  -4  arityfP,) 

1=1 

In  this  equation,  2  counts  the  enuy  and  exit  lattices,  arity  fP,)  is  (he  number  of  arguments  in  the  predicate 
lattice,  and  4  is  (he  depth  of  each  argument  lattice.  This  bound  on  N^r  is  wildly  pessimistic.  For  most  real 
programs  Nmt  is  bounded  by  a  small  constant  All  the  benchmark  programs  satisfy  £7  (Chapter  7). 
However,  there  exist  pathological  predicates  P*  for  which  Nmt  =  6  ( arity  (P. ) ).  For  example,  consider  the 
program; 

main  »  19.  . 

a  ( 0. . 

a(N,A,A,C.D,E,F,G.H,  I. J)  N1  is  N-J.  a  (N1 . A, C. D. E, F. G, H.  1.  J,  A)  . 

The  analyzer  requires  10  passes  to  determine  that  all  arguments  of  a/ 11  arc  ground  and  dereferenced 


upon  exit. 
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To  summarize  these  results,  the  worst<ase  and  average  case  total  execution  limes  of  analysis  for  programs 
without  non-lineariy  recursive  predicates  are; 


—  0(A  'S'  Duui  ) 


"^analystj  .av*  ~  O  'S  ) 

If  the  arity  is  bounded,  then  the  average  execution  time  of  analysis  is  proportional  to  the  program's  size. 

For  programs  that  contain  non-lincarly  recursive  predicates  this  result  needs  to  be  amended.  There  is 
a  trade-off  between  precision  and  execution  time  of  the  analysis.  If  not  enough  predicates  are  traversed 
then  analysis  information  is  lost.  If  too  many  predicates  are  traversed  then  analysis  time  becomes  too  long. 
Two  constraints.arc  used  to  prune  the  traversal  of  the  call  graph: 

(1)  The  non-active  consuaint  A  clause  that  is  in  the  process  of  being  traversed  is  called  an  active 
clause.  During  recursive  calls  of  predicate_analyze,  the  algorithm  maintains  a  set  of  the  active 
clauses  and  will  not  traverse  an  active  clause  twice. 

(2)  The  non-exponentiality  constraint.  Traverse  a  predicate  (i.e.  call  predicatc_analyze)  only  if  one  of 
two  conditions  hold:  (a)  The  entry  type  has  changed  since  the  last  traversal  of  the  predicate,  or  (b)  At 
least  one  of  the  predicate’s  clauses  is  active. 

Condition  (a)  is  understandable:  it  is  needed  to  ensure  that  an  updated  type  is  propagated  correctly.  The 
rationale  for  condition  (b)  is  more  subtle.  If  it  did  not  hold,  then  the  exit  types  derived  by  the  analysis 
would  be  significantly  worse  because  the  base  case  of  a  recursive  predicate  may  not  be  reached  during  the 
traversal.  Running  the  analyzer  both  with  and  without  this  condition  shows  this  to  be  true  for  most  pro¬ 
grams. 

The  problem  with  condition  (b)  is  that  it  leads  to  an  analysis  time  that  is  exponential  in  the  number  of 
non-lincarly  recursive  clauses  in  a  predicate.  For  many  programs  this  is  not  serious.  However,  it  occurs 
often  enough  that  it  should  be  solved.  One  of  the  benchmark  programs,  the  nand  benchmark,  has  this  prob¬ 
lem.  A  better  condition  is  needed  to  replace  condition  (b).  It  must  (1)  ensure  that  the  base  case  of  all 
recursive  predicates  is  reached  (for  good  exit  types),  and  (2)  not  result  in  lime  exponential  in  the  number  of 
non-lincarly  recursive  predicates. 
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S  J.S.  Symbolic  execution  of  a  predicate 

The  heart  of  the  dataflow  analysis  algorithm  is  the  symbolic  execution  of  a  predicate  (F'gure  4.6). 
Each  clause  of  the  predicate  is  traversed  from  left  to  right.  During  the  traversal  the  type  information  is  kq)l 
in  the  variable  set  VS.  Symbolic  execution  of  the  predicate  consists  of  four  steps: 

(1)  For  each  clause  of  the  predicate,  translate  the  lattice  entry  type  of  the  predicate  into  the  variable  set 
VS.  and  start  traversing  the  clause. 

(2)  Symbolically  execute  each  goal  in  the  clause  and  update  VS. 

(3)  At  the  end  of  each  clause,  back  propagation  improves  VS  by  deducing  information  that  only 
becomes  available  at  the  end  of  the  clause.  For  example,  consider  the  clause: 

a(X)  X-(YIL) ,  b(Y.  L)  . 

If  both  Y  and  L  are  in  the  ground  set  G  of  VS  at  the  end  of  the  clause  then  this  is  also  mie  of  X 
because  of  the  unification  x=(Y|L].  Back  propagation  is  used  to  improve  the  exit  types  for 
ground,  recursive  dereference,  and  nonvariablc  types.  Measurements  show  that  it  is  a  necessary  step 
to  get  good  exit  types. 

(4)  At  the  end  of  the  predicate,  combine  the  variable  sets  of  all  clauses  by  intersecting  their  correspond¬ 
ing  components.  Convert  the  result  back  to  the  lattice  lepresenution  and  update  the  exit  type  for  the 
predicate. 

5  J.6.  Symbolic  execution  of  a  goal 

Symbolic  execution  of  a  goal  is  done  in  three  ways,  depending  on  whether  the  goal  is  a  uniScation,  the  goal 
is  defined  in  the  program,  or  the  goal  is  not  defined  in  the  program. 

5J.6.1.  Unification  goals 

Symbolic  execution  of  unification  is  defined  by  the  function  symbolic_unify(VS,  X=T)  in  Figure  4.8. 
which  converts  VS  =  (S,  G,  N,  U,  D)  into  VS '  =  (S G N U ',  D ').  These  equations  use  the  utility  func¬ 
tions  of  Table  4.6.  For  each  component  of  VS.  any  equation  in  Figure  4.8  with  a  true  condition  can  be 
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Table  4.6  -  Utility  functions  of  a  term  T 

Notation 

Definition 

vars(X) 

The  set  of  variables  in  the  term  T. 

dupsiT) 

The  set  of  variably  that  occur  at  least  twice  in  the  term  T. 

new{T)  =  vars(T)  -  S 

The  set  of  all  variables  in  T  that  have  not  occurred  before. 

old{T)  =  vars{T)  O  S 

The  set  of  all  variables  in  T  that  have  occurred  before. 

deref{T)  =  vars(T)  -  (S  -  U) 

The  set  of  all  variables  in  T  that  arc  candidates  to  be  recursively 
dereferenced.  This  is  the  same  as  neMJ)  u  {vars{T)  o  U),  i.e. 
neMT)  supplemented  with  the  variables  in  T  that  are  uninitialized. 

S '  =  S  u  vars  (X=T) 


J  G  u  vars  (T) 


if  XeG 
otherwise 


N'  = 


NuG'u  {X) 
NuG' 


if  nonvar(T)  or  (var(T)  and  Te  N ) 
otherwise 


U'  = 


,  Uu/iew(T) -oW(T) -  (Xj  - iffXeSorXelJ) 
U  -  vars  (X=T)  otherwise 


D'  = 


Duderef  (T)  u  (X) 
D  u  dcref  (T) 
Ducleref(r)  ~  (X) 
DnG 


if  (XeSorXeU)  and  oW(T)c(DuU) 
if  (X«SorX€U)  or  Xe(DnG) 
if  <fi^s(T)=0and X€Dand<7iif(T)cU 
otherwise 


Figure  4.8  -  Symbolic  unification  VS '  :=  symboIic_uniry(VS,  X=T; 


used.  In  practice,  if  more  chan  one  condition  is  satisfied,  an  equation  giving  mwe  infonnaiion  (i.e.  the 
resulting  set  is  larger)  is  used  first  These  equations  are  listed  first  For  example,  the  first  equation  of  D ' 
gives  a  larger  set,  so  it  is  preferred  over  the  others.  If  both  X  and  T  are  va  ^I’es,  then  the  algorithm 
switches  X  and  T  is  to  sec  if  one  of  the  more  desirable  equations  is  satisfied  before  attempting  one  of  the 
lesser  equations. 
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Table  4.7  -  Conditions  for  the  lattice  entry  type 

Name 

Condition 

vars{X)  c  G 

var(X) 

Ceid 

X  e  (dups(P)  u  S  -  U) 

^rderef 

(vars(X)  n  S)  c  D 

(Xe  N) 

Table  4.8  -  Calculation  of  the  bttice  entry  type 

c 

Lattice  value 

yes 

- 

- 

yes 

- 

ground+rderef 

yes 

- 

.no 

- 

ground 

no 

no 

- 

yes 

- 

nonvar-Hrderef 

no 

no 

- 

no 

- 

nonvat 

no 

yes 

no 

- 

- 

uninit 

no 

yes 

yes 

yes 

yes 

nonva  r + rde  re  f 

no 

yes 

yes 

yes 

no 

rderef 

no 

yes 

yes 

no 

yes 

nonva r 

no 

yes 

yes 

no 

no 

any 

5-3.6.2.  Goals  defined  in  the  program 


Symbolic  execution  of  a  goal  with  a  definition  is  done  by  symbolically  executing  the  definition. 
Information  is  kept  about  the  pan  of  the  call  graph  that  has  already  been  traversed,  so  that  analysis  will  not 
go  into  an  infinite  loop.  The  function  varsei_lo_lauice(VS,  P)  is  defined  by  Tables  4.7  and  4.8.  For  each 
argument  X  of  P,  first  determine  the  values  of  the  five  conditions  in  Table  4.7.  Then  use  these  conditions  to 
look  up  the  tauicc  value  for  the  argument  in  Table  4.8. 


5  J.6.3.  Goals  not  defined  in  the  program 


# 


Examples  of  goals  that  are  not  defined  in  the  program  being  analyzed  are  built-ins  and  library  piedi-  # 

cates.  Symbolic  execution  of  these  goals  is  done  in  two  parts.  First,  entry  specialization  tepbees  the  ^oal 
by  a  faster  entry  (section  5.4.1).  Second,  the  type  declarations  that  the  programmer  has  given  for  the  entry 
are  used  to  continue  the  analysis.  If  there  are  none,  then  worst-case  assumptions  are  made.  % 


5J.7.  An  example  of  analysis 


The  following  program  is  interesting  because  it  is  mutually  recursive: 
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_ Table  4,9  -  Analysis  of  an  example  program _ 

incl_2  {A,  B,  C)  incl_3  (A,  B, C,  0) 

A  B  C  :  A  B  C  D 


Start 


entry 

exit 

impossible  impossible  impossible 
impossible  impossible  impossible 

impossible  impossible  inpossible  impossible 
impossible  impossible  impossible  impossible 

After  pass  1 

entry 

exit 

rderef  uninit  u-^a.u 

nonvar  rderef  nonvar 

uninit  rderef  rderef  uninii 

rderef  any  ground  nonvar 

Aficr  pass  2 

entry 

exit 

rderef  uninit  uninit 
nonvar  rderef  nonvaf 

uninit  rderef  rderef  uninit 
rderef  any  nonvar  nonvar 

After  pas.s  3 

entry' 

exit 

rderef  uninit  uninit 

'nonvar  rderef  nonvar 

uninit  rderef  rderef  uninit 
rderef  any  nonvar  nonvar 

nvair\  incl_2  ( (A,  B)  .  C,  D)  . 
incl_2((l,  C,  (Cl). 

incl_2 ((AIE) ,  C.  D)  incl_3 (C,  A,  E,  D) . 
incl_3(C,  A,  E.  (AID!)  incl_2 (E,  C,  D) . 

The  predicates  incl_2/3  and  incl_3/4  arc  extracted  from  a  definition  of  set  inclusion.  Three 
analysis  passes  arc  necessary  to  reach  the  fixpoint  (Table  4.9).  The  entries  that  have  changed  with  respect 
to  the  previous  pass  arc  in  italics.  The  final  types  are  given  in  Table  4.10.  Most  of  the  correct  types  arc 
determined  after  the  first  pass.  A  single  exit  type  of  incl_3/4  is  corrected  in  the  second  pass.  This  is 
necessary  because  the  third  argument  of  incl_3/4  is  the  same  as  the  first  argument  of  incl_2/3. 


_ Table  4,10  -  Final  results  of  analysis _ 

incl_2 (A,B,C) 

entry  type:  rderef  (A) ,  uninit  (B) ,  uninit  (C) 

exit  type:  nonvar  (A)  /  rderef  (B)  ,nonvar(C) _ 

incl_3 (A,B,C,D) 

entry  type:  uninit  (A)  ,  rdecef  (B) ,  rderef  (C) ,  uninit  (D) 
exit  type:  rderef  (A) ,  nonvar  (C) ,  nonvar  (D) _ 


5.4.  Integrating  analysis  into  the  compiler 


Deriving  type  information  is  only  the  beginning.  The  analyzer  must  be  integrated  into  the  compiler 
to  take  advantage  of  the  information.  The  dataflow  analysis  module  itself  docs  four  source  transfor¬ 
mations  (Figure  4.9)  before  pa.s.sing  the  result  to  the  next  stage,  which  docs  determinism  extraction.  The 
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kernel 

Prolog 


entry 

declarations 
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specialization 
(replace  goals) 


specialized  entries 


derived 

types 


kernel  Prolog 
'with  specialized  entries 
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conversion 


derived  types  with 
uninitialized  register  modes 
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unraveling 


kernel  Prolog 
with  unraveled  heads 


outputs- 


kemel 

Prolog 


Figure  4.9  -  Integrating  analysis  into  the  compiler 
following  source  transformations  are  done  in  the  dataflow  analysis  module: 

(1)  Entry  specialization.  Determine  a  Cast  entry  point  for  each  occurrence  of  a  call  vdtosc  definition  is 
not  in  the  program  being  analyzed  and  continue  analysis  with  this  entry  point 

(2)  Uninitialized  register  conversion.  Convert  uninitialized  memory  types  to  uninitialized  register 
types  when  it  results  in  a  speedup.  It  is  done  when  an  argument  can  be  returned  in  a  register  without 
giving  up  last  call  optimization. 


(3)  Head  unraveling.  Unravel  the  heads  of  all  clauses  again  in  the  light  of  the  derived  type  information. 
For  example,  the  head  a  (A,  A,  A)  can  be  unaveied  in  three  different  ways,  namely 
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(a  (A,  B,C)  :-A=B,C*=B)  or  (a  (A,B,C)  :-A=C,B=C)  Or  (a  (A,  B, C)  : -B-A, C-A) .  If 
both  A  and  B  arc  nonvariabics  and  C  is  unbound,  then  the  first  or  third  possibilities  allow  the  com¬ 
piler  to  do  argument  selection.  Unraveling  is  already  done  during  the  conversion  to  kernel  Prolog, 
but  it  must  be  done  again  after  dataflow  analysis  since  the  new  types  may  allow  it  to  be  done  better. 

(4)  Type  updating.  Supplement  the  type  declarations  given  by  the  programmer  (if  any)  by  the  derived 
types.  All  inconsistencies  are  reported  and  compilation  continues  with  the  corrected  types. 

The  first  three  of  these  transformations  are'discusscd  in  more  detail  in  the  following  sections. 

5.4.1.  Entry  specialization 

During,^lysis.  a  fast  entry  point  is  determined  for  each  call  whose  definition  is  not  in  the  program 
being  analyzed  (i.e.  each  da/igfi/ig  call).  For  example,  the  call  sort  (A,  B)  is  replaced  by  the  entry  point 
'  $sort  *2'  (A,  B)  if  B  is  uninitialized.  Analysis  continues  with  the  types  of  the  fast  entry  point  The 
program  is  unchanged  until  the  end  of  analysis,  so  the  determination  of  the  fast  entry  point  is  repeated  in 
each  analysis  iteration  whenever  a  dangling  call  is  encountered.  This  mechanism  is  intended  to  speed  up 
execution  of  built-in  predicates  and  iibrar)'  routines,  but  it  is  also  available  to  the  programmer. 

The  fast  entry'  point  is  determined  by  calculating  the  type  formula  corresponding  to  the  variable  set 
VS  with  the  function  varsci_to_typc(V'S ,  C  )  (Figures  4.6  and  4.7).  This  type  formula  is  used  to  traverse 
the  modal  entry  tree  for  the  goal.  The  modal  entry  tree  is  a  data  structure  that  contains  a  set  of  entry  points 
and  the  types  that  each  requires  (Appendix  A).  Entry  specialization  is  also  done  in  the  clause  compiler, 
and  a  detailed  example  of  the  use  of  the  modal  entry  tree  is  given  in  Chapter  5  (section  3.4). 

5.4.2.  Uninitialized  register  conversion 

Often  an  uninitialized  memory  type  can  be  convened  to  an  uninitialized  register  type.  The  compiler 
uses  four  conditions  to  guide  the  conversion  process.  Define  a  survive  goal  as  one  that  docs  not  alter  any 
temporary  registers  (except  for  arguments  with  uninitialized  register  type,  which  arc  outputs).  A  goal  that 
potentially  alters  temporary*  registers  is  a  non-survive  goal.  The  compiler  maintains  a  tabic  of  survive 
goals.  With  these  definitions  the  four  conditions  for  a  predicate  P  arc: 


( 1 )  All  arguments  of  P  with  uninitialized  memory  type  arc  candidates  to  be  convened. 

(2)  A  candidate  argument  of  P  must  occur  at  most  once  in  the  body  of  each  clause  of  P.  In  each  clause 
where  it  occurs,  the  argument  must  be  in  the  last  non-survivc  goal  or  any  survive  goal  beyond  it 

(3)  For  each  clause  of  ,  if  the  last  goal  C  is  a  non-survivc  goal,  then  the  candidate  argument  of  P  must 
be  in  the  same  argument  position  in  G  as  in  the  head  of  P .  This  is  necessary  to  avoid  losing  the 
opponunity  for  last  call  optimization  (LCO);  if  the  argument  positions  arc  different  then  a  move 
insuuction  is  needed  between  the  last  call  and  the  return.  If  the  last  goal  is  a  survive  goal  then  the 
condition  is  unnecessary  because  it  is  not  as  important  to  retain  LCO;  a  survive  goal  can  never  be 
mutually  recursive  with  the  predicate  it  is  pan  of. 

(4)  Often  the  last  goal  G  has  candidate  arguments  that  are  not  candidate  arguments  of  /* .  so  they  have  to 
be  initialized  when  returning  from  G.  This  has  two  disadvanuges:  P  loses  LCO  and  P  must  allo- 

.  cate  an  environment  (which  may  not  exist  otherwise).  The  solution  to  this  problem  involves  a  trade¬ 
off;  is  it  better  to  have  LCO  in  P  and  fewer  uninitialized  register  arguments  in  G ,  or  to  have  no  LCO 
in  P  and  more  uninitialized  register  arguments  in  G?  The  compiler  recognizes  a  class  of  predicates 
G  for  which  the  first  is  true;  Define  a  fast  predicate  as  one  whose  definition  contains  only  built-ins 
and  survive  goals.  If  G  is  fast  then  reduce  the  set  of  G ’s  candidate  arguments  to  include  only  those 
that  arc  candidate  arguments  of  P . 

A  transitive  closure  is  done  until  all  four  conditions  arc  satisfied.  These  conditions  can  be  relaxed  slightly 
in  several  ways.  However,  even  with  the  existing  conditions  it  is  possible  to  convert  about  one  third  of  all 
uninitialized  types  into  uninitialized  register  types  (Chapter  7).  The  third  and  fourth  conditions  are  not 
needed  for  correctness,  but  only  for  execution  speed.  The  third  condition  ensures  that  LCO  is  not  lost  The 
fourth  condition  speeds  up  the  chat  .parser  buichmark  by  1%  and  was  added  after  code  inspection 
discovered  cases  where  the  use  of  uninitialized  registers  slows  down  execution. 

5.4 Head  unraveling 

This  transformation  repeats  the  head  unraveling  transformation  (Chapter  3)  with  the  information 
gained  from  dataflow  analysis.  ThLs  increases  the  opportunities  for  determinism  extraction.  For  example. 


before  analysis  ihc  clause: 


a(X,X.X)  . 

is  uansforrned  to  the  following  kernel  Prolog  by  making  the  head  unifications  explicit  (i.e.  “unraveling’' 
(he  head  unifications): 

a(X,Y.2)  X-Y.  X-Z. 

If  analysis  derives  that  X  is  unbound  and  both  Y  and  Z  are  nonvariable,  then  the  above  expansion  hides  the 
determinism  by  twice  unifying  an  unbound  variable  with  a  nonvariabic.  Unraveling  the  head  uniCcations 
again  after  analysis  results  in: 

a{X.Y,.2)  Y-Z.  X-Y. 

In  this  version,  the  nonvariables  Y  and  Z  are  unified  together,  beuer  exposing  the  deterministic  check  that 
is  done,  and  the  unbound  variable  X  is  only  unified  once. 

6.  Determinism  transformation 

This  section  groups  four  transformations  that  expose  the  detenninism  inherent  in  a  predicate.  The 
purpose  of  the  first  three  transformations  is  to  make  the  determinism  in  the  predicate  easily  visible,  so  that 
the  fourth  transformation,  determinism  extraaion,  is  as  successful  as  possible  in  generating  case  state¬ 
ments.  The  following  transformations  are  dtme  in  order 

(1)  Head-body  segmentation.  By  separating  the  heads  of  clauses  from  the  clause  bodies,  this  reduces 
the  code  expansion  caused  by  type  enrichment  and  detenninism  extraction. 

(2)  Type  enrichment.  This  adds  types  to  predicates  for  which  global  analysis  is  not  able  to  determine 
the  type.  The  compiler  creates  different  versions  of  the  predicate  assuming  different  input  types. 
This  increases  code  size,  but  improves  performance  since  often  a  predicate  is  deterministic  at  run¬ 
time  even  though  this  could  not  be  detected  u  compilc-time. 

(3)  Goal  reordering.  This  reorders  goals  in  a  clause  to  expose  more  determinism.  Tests  (such  as  arith¬ 
metic  relations)  arc  moved  to  the  left  and  predicates  guaranteed  to  succeed  (such  as  unifications  with 
uninitiali/«d  variables)  are  moved  to  the  right. 
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(4)  Determinism  extraction  with  test  sets.  This  transformation  converts  the  predicate  into  a  nested 
case  statement  that  makes  its  determinism  explicit,  so  that  a  straightforward  compilation  to  BAM 
code  is  possible. 

6.1.  Head-body  segmentation 

This  transformation  reduces  the  code  expansion  resulting  from  enrichment  and  determinism  extrac¬ 
tion.  A  predicate  is  split  into  a  new  predicate  and  a  set  of  clause  bodies.  The  new  predicate  contains  only 
the  goals  of  the  original  predicate  that  ate  useful  for  determinism  extraction.  i.e.  all  explicit  unifications  and 
tests  (including  type  checking  and  arithmetic  comparisons,  see  Table  4.1 1)  in  each  clause  starting  from  the 
head  up  to  the  first  goal  that  is  not  in  this  category.  The  rest  of  the  clause  bodies  are  separated  from  the 
predicate.  This  is  done  to  avoid  code  duplication  in  determinism  extiaaion.  since  the  same  clause  may 
occur  in  several  leaves  of  the  decision  tree. 

For  example,  the  predicate: 
p(A,B) 

(  var(A),  ptA),  q(A,C).  t(C,D),  u(D.B) 

;  A-b.  r(A).  s(A) 

)  . 

is  transformed  into: 

P(A.B) 

(  vartA) ,  '5dl' <A,B) 

;  A-b,  •$d2'(A) 

) . 

'$dl'tA.B)  p(A).  q(A.C).  t(C.D).  utD.B). 

'$d2' (A)  r(A) .  a(A) . 

The  new  predicate  consists  only  of  those  pans  of  the  original  predicate  that  are  useful  for  extracting  deter¬ 
minism.  The  determinism  exoaction  is  free  to  create  a  decision  tree  from  the  new  predicate  without  worry¬ 
ing  about  duplicating  the  clause  bodies  at  the  leaves  of  the  tree.  The  separated  clause  bodies  arc  compiled 
once  only,  and  the  BAM  transformation  stage  (Chapter  6)  merges  them  with  the  decision  tree,  thus  creating 
a  decision  graph. 


The  decision  exactly  where  to  split  the  clause  bodies  depends  on  several  factors.  All  goals  in  the 
body  are  classified  into  two  kinds:  goals  that  are  useful  for  extracting  determinism  (called  "tests”),  and 
other  goals.  Then  the  split  follows  these  rules:  (I)  Only  those  tests  all  of  whose  variables  are  in  the  head 
become  part  of  the  new  predicate.  (2)  If  the  length  of  the  clause  body  is  less  than  a  given  threshold,  then 
all  of  it  becomes  part  of  the  new  predicate. 

Head-body  segmentation  interacts  with  type  propagation.  It  often  occurs  that  a  clause  body  is  called 
from  several  leaves  in  the  decision  tree  with  different  types.  In  that  case,  it  is  compiled  with  a  type  that  is 
the  intersection  of  the  types  of  the  entry  points.  A  complication  arises  when  one  of  the  leaves  considers  a 
variable  to  be  uninitialized,  and  another  leaf  docs  not.  In  that  case,  the  first  leaf  jumps  to  a  piece  of  code  to 
initialize  the  variable,  and  only  afterwards  jumps  to  the  clause  body. 

6.2.  Type  enrichment 

By  looking  at  the  type  or  the  value  of  one  or  more  arguments  it  is  possible  to  leduce  the  set  of 
clauses  that  have  to  be  tried.  Often  the  dataflow  analysis  is  able  to  derive  sufficiently  strong  types  so  that  a 
good  selection  can  be  done,  i.e.  a  deterministic  predicate  can  be  compiled  efficiently.  However,  if  the 
lypcs  given  for  the  predicate  are  weak  then  a  source  transformation  is  done  to  enrich  them.  The  enrich¬ 
ment  consists  of  adding  a  test  to  check  at  run-time  whether  an  argument  is  a  variable  or  a  nonvariabic,  and 
to  branch  to  different  copies  of  the  predicate  in  each  case. 

The  number  of  arguments  that  arc  enriched  is  given  by  the  argument  S  of  the  compiler  option 
select_limit  (S) .  Define  a  good  predicate  argument  as  one  that  is  an  argument  of  a  unification  not 
known  to  succeed  always,  i.e.  in  the  unification  neither  argument  is  known  to  be  unbound.  An  argument  is 
known  to  be  of  a  given  type  if  the  type  is  implied  by  the  type  formula.  Whether  or  not  enrichment  is  done 
is  based  on  the  following  heuristic: 

Enrichment  Heuristic  1:  If  the  number  of  good  arguments  known  to  be  nonvariable  is  less 
than  the  selection  limit  S,  then  choose  the  low*est  numbered  good  argument  that  is  not  known 
to  be  nonvariabic.  Otherwise  choose  only  the  first  argument,  if  it  is  a  good  argument  and  it  is 
not  known  to  be  nonvariabic. 

This  heuristic  is  applied  recursively  on  enriched  prodicaics.  The  default  selection  limit  is  always  Ssl. 
This  default  is  justified  gi.ven  that  (I)  a  selection  limit  S=l  already  generalizes  the  first  argumciK  selection 
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of  the  WAM,  and  (2)  compilation  time  and  object  code  size  increase  rapidly  with  the  selection  limit.  Even 
with  S=l,  the  source  transformation  occasionally  results  in  some  duplicate  code  being  generated.  This  is 
removed  by  the  BAM  transformation  stage.  When  S=l  the  heuristic  is  simpler 

Enrichment  Heuristic  2;  If  there  exist  no  good  arguments  known  to  be  nonvariabic.  then 
choose  the  lowest  numbered  good  argument  that  is  not  known  to  be  nonvariabic.  Otherwise 
choose  the  first  argument,  if  it  is  a  good  argument  and  it  is  not  known  to  be  nonvariabic. 

This  heuristic  generalizes  the  first-argument  selection  of  the  WAM,  i.c.  it  always  docs  at  least  a  first  argu¬ 
ment  selection,  but  depending  on  the  types  Uiat  the  predicate  has  (often  derived  from  dataflow  analysis)  and 
the  predicate  itself  (what  kinds  of  head  unifications  it  does),  the  amount  of  selection  can  be  vastly  greater. 
The  heuristic  may  seem  complex,  but  it  is  a  natural  way  to  make  a  predicate  deterministic. 

To  show  how  enrichment  works,  consider  the  following  predicate  without  type  declarations; 

a  (a)  . 
a(b)  . 

It  is  transformed  into: 

a  (A)  var(A),  a_v (A) .  )  If  A  is  unbound, 

a  (A)  nonvar(A),  a_n (A) .  %  If  A  is  nonvaciable. 

a_v(a) .  a_n(a). 

a_v (b) .  a_n (b) . 

The  predicate  a / 1  has  been  enriched  with  an  unbound  type  (in  a_v/ 1)  and  with  a  nonvariabic  type  (in 
a_n/ 1).  As  another  example,  consider  the  deflnition  without  any  type  declarations; 

member  <X,  IX I_1 ) . 
membertX.  (_|LJ). 

In  this  case  the  heuristic  picks  the  second  argument,  since  the  first  one  does  no  useful  unifications.  After 
enrichment,  the  predicate  becomes: 

member |X,  L)  var(L),  member_v(X,  L). 

member (X,  L)  nonvar(L),  member_n(X,  L) . 

mefflber_v ( . . . )  (same  as  original  definition) 

member  n(...l  (same  as  original  definition) 

The  two  tests  var  (L)  and  nonvar  (L)  determine  which  of  the  two  dummy  predicates  to  execute, 
(nember_v/2  or  fnember_n/2,  and  arc  compiled  into  a  single  conditional  branch.  This  is  a 


consequence  of  the  fact  that  the  two  tests  arc  mutually  exclusive.  i.e.  if  one  succeeds  then  the  other  fails 
and  vice  versa.  Both  men\ber_v/2  and  me™bet_n/2  have  the  same  definition  as  the  original  predi¬ 
cate,  but  they  have  different  types  for  the  second  argument.  The  predicate  ineinber_v/2  is  compiled 
assuming  the  second  argument  is  a  variable.  The  predicate  member_n/2  is  compiled  assuming  the 
second  argument  IS  a  nonvariable.  Both  member_v/2  and  member_n/2  arc  also  targets  of  the  factor¬ 
ing  iransformauon  (secuon  4). 

Type  enrichment  can  inuoduce  a  significant  increase  in  code  size  if  it  is  not  handled  carefully.  In 
practice,  the  code  sire  is  kept  snull  because:  (1)  the  added  types  result  in  significantly  smaller  code  for 
clause  selection  in  each  of  the  two  dummy  predicates.  (2)  before  doing  enrichment,  head-body  segmenta¬ 
tion  separates  clause  heads  from  the  bodies,  so  that  long  clause  bodies  arc  not  duplicated,  and  (3)  the  BAM 
transformation  stage  (Chapter  6)  removes  any  remaining  duplicate  code.  In  a  sense,  the  definitions  are  first 
"loosened  up"  by  head-body  segmentation  and  type  enrichment  to  allow  more  optimization,  and  then  later 
"tightened  up.” 

6  J.  Goal  reordering 

This  uansformation  reorders  goals  in  a  clause  to  increase  determinism  and  to  reduce  the  number  of 
superfluous  unifications  that  arc  done.  Goals  that  are  useful  in  determinism  extraction  are  put  as  early  as 
possible,  and  goals  that  are  certain  to  succeed  (such  as  unifications  with  uninitialized  variables)  are  put 
later. 

The  goak  in  a  clause  are  classified  in  four  categories:  tests  (Table  4.1 1),  unifications  with  unbound 
variables,  unifications  with  uninitialized  variables,  and  other  goals.  The  goals  are  reordered  so  that  tests 
are  first  (for  deterministic  selection),  followed  by  unifications  with  unbound  variables  (may  be  affected  by 
aliasing),  unifications  with  uninitialized  variables  (unaffected  by  aliasing,  so  they  can  safely  be  put  last), 
and  the  other  goals.  The  reordering  takes  into  account  the  fact  that  unification  is  commutative,  i.e.  that 
unification  goals  can  be  permuted  in  any  way  without  changing  the  semantics.  Some  reorderings  arc  beuer 
than  others  because  aliasing  can  worsen  the  type  formula,  e.g.  if  X  is  unbound  (var  <X) )  then  after  per¬ 
forming  the  unification  Y*z  it  may  not  be  unbound  any  more,  if  it  is  aliased  to  Y  or  Z.  The  reordering  is 


constrained  so  that  aliasing  docs  not  change  the  operational  semantics. 

For  example,  consider  a  predicate  that  has  an  uninitialized  argument; 

t 

mode  ( (a  (A,  B,  C)  uninit  (C)  ))  ■ 
a(X,  Y,  Z)  2-(X|Ll.  X<Y.  ... 

The  transformation  knows  that  the  unification  Z«  (X  |  L]  does  not  instantiate  X  or  L  because  Z  is  unbound 
and  unaliascd.  Therefore  the  unification  is  moved  back: 

a(X,  Y.  Z)  X<Y.  Z-IXIL).  . .’ 

This  has  two  advantages:  (1)  the  test  x<Y  is  brought  forward  so  that  it  can  be  used  by  determinism  extrac¬ 
tion,  and  (2)  the  unification  z°  [  X I L  ]  is  not  done  if  the  test  x<Y  fails. 

This  transformation  compensates  for  the  popular  programming  style  which  puts  all  unifications  in  the 
head  and  all  tests  in  the  body,  e.g.  people  prefer  to  write: 

a((X|Ll,  (XIMJ)  var(X),  ... 
instead  of: 

atlXILl,  Z)  var(X),  Z-IXIM),  ... 

The  first  version  does  not  imply  anything  about  the  instantiation  pattern  of  the  arguments,  whereas  the 
transformed  version  does. 

6.4.  Determinism  extraction  with  test  sets 

The  majority  of  predicates  written  by  human  programmers  are  intettded  to  be  executed  in  a  deter¬ 
ministic  way.  These  predicates  are  in  effect  case  statements,  yet  they  are  too  often  compiled  in  an 
inefficient  manner,  by  means  of  shallow*  backtracking  (i.e.  saving  the  machine  state,  unification  with  the 
clause  heads,  and  repeated  failure  and  state  restoration).  This  section  describes  the  general  technique  used 
in  the  compiler  to  convert  shallow'  backtracking  into  conditional  branching. 
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Test  set 


Instruction 


{  A<B,  } 


branch  if  less  than 


four-way  branch  on  type 
{ var (A) , atomic (A) , cons (A) , structure (A) ) 

look  up  in  hash  table 

{ A-a , A«b , A-c , A-d, A-e , A-f , A-g ) 

Figure  4.10  -  Some  examples  of  test  sets 


6.4.1.  Definitions 


Predicates  arc  compiled  into  code  which  is  as  deienninislic  as  possible  through  the  coneqx  of  the 
test  set.  Two  definitions  arc  useful: 

Definition  ST:  A  goal  C  is  a  simple  test  with  respect  to  the  kernel  Prolog  predicate  P  and  the 
formula  F  if  it  satisfies  the  following  conditions; 

•  G  uses  only  variables  that  occur  in  the  head  of  P . 

•  The  implementation  of  G  docs  not  change  any  state  in  the  execution  model,  ix.  C 
docs  not  cause  side-effects  (1/0  or  database  operations).  G  does  not  create  choice  points, 
and  G  does  not  bind  any  variables. 

•  G  docs  not  always  succeed. 

Definition  TS:  A  set  of  goals  is  a  ust  set  with  respea  to  the  kernel  Prolog  predicate  P  and  the 
type  formula  F  if  it  satisfies  the  following  conditions: 

•  Each  goal  in  the  set  is  a  simple  test  according  to  definition  ST. 

•  With  a  given  set  of  variable  valucs.«  most  one  goal  in  the  set  can  succeed. 

•  A  multi-way  branch  in  which  each  destination  corresponds  to  the  success  of  one  of  the 
goals  in  the  set  can  be  implemented  in  dtc  latgct  archiieaurc. 

The  tesu  in  the  set  need  not  actually  be  present  in  the  definition  of  P.  Whether  or  not  a  given  set  of  goals  is 
a  test  set  depends  on  the  architecture  and  the  predicate  P. 
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6.4.2.  Some  examples 

Most  conditional  branches  in  an  architecture  correspond  to  a  lest  set.  For  example,  a  branch-if-Iess- 
than  instruction  corresponds  to  the  test  set  {A<B,  ASB) .  More  complex  conditions  such  as  an  n-way 
branch  implemented  by  hashing  can  also  be  represented  as  test  sets.  Figure  4.10  shows  some  examples  of 
test  sets.  The  second  and  third  examples  correspond  to  WAM  instructions. 

To  illustrate  the  use  of  test  sets,  consider  the  predicate; 

max  (A.  B,  C)  A<B,  C-B. 

max  (A,  B,  C)  A>B,  C-A. 

which  is  one  way  to  calculate  the  maximum  of  A  and  B.  It  is  compiled  as; 

« 

• 

max (A,  B,  C)  if  A>B  then  C-A 

else  if  A<B  then  C-B 
else  (C-B  or  C-A) 

(The  Prolog  notation  is  simplified  for  readability.)  The  piedicaie  is  executed  completely  detenninistically  if 
A>B  or  A<B;  a  choice  point  is  created  only  when  A*B.  The  choice  point  maintains  the  operational 
semantics;  since  both  clauses  of  the  original  predicate  succeed  when  a-b  ,  there  are  two  identical  solu¬ 
tions. 


type  testset  =  testset(testset_name,  testseijdent,  set  of  goal); 

function  dc(crminism(f7  :  disjunction;  H  :  goal;  F  :  formula;  Previous  :  set  of  testset) ;  disjunction; 
var  TS  :  testset; 

75m,  :  SCI  of  testscu  # 

begin 

if  length(D)  ^  I  then  return  D; 

TS„,  :=  find_testsets(D  ,H.F,  Previous)-, 
if  7Sw(  =  0  then  return  D ; 

TS  :=pick_testsct(7S„,); 

return  codc_tcstsa(rS  ,D,H ,  F,  Previous )  ♦ 

end; 


Figure  4.1 1  -  The  detenninism  extraction  algorithm 


function  find_tcstseis(D  :  disjunction,  H  :  goal;  E  ;  fonnuta;  Previous  :  set  of  testset) ;  set  of  testseu 
var  TS  :  lesisci; 

TSui  :  set  of  testset; 
i  .j  :  integer. 

begin 

/•  At  this  point  D  =  (C  i ; ;  C« )  where  Z>  has  n  choices  •/ 

TS„,  :=  0; 

for  i  :=  I  to  n  do  begin 

/•  C,  =  (C,  . . Gin)  where  Ci  has ni  goals  •/ 

for  j  :=  1  to  n,  do 

if  Gij  =  then  exit  inner  loop 

else  for  all  tesiscts  TS  from  table  do  begin 

f*  TS  -  testset(Name  ,/de/u ,  Tests)  from  Table  4.1 1  •/ 

-  if  TS  6  Previous  and  varsiG^j)  c  vars(H)  and  bindsctfCi/,  f )  »  0  then 

if  3  T€  Tests  ;  (Gy  implies  T  and  not(f  implies  T))  then 
.  rs«,  :=  TSst,  VJ  (TS ) 

•  end 

end; 

return  TSs,i 

end; 


Figure  4.12  -  Finding  all  test  sets  in  a  predicate 


function  pick_testset(rSM, :  set  of  testset) :  testset; 

var  TS  :  testset; 

begin 

pick  TS  €  TSmi  such  that 

V  l/ers,«  :  goodness(T5)  2  goodness((/);  f*  From  Equation  (G)  •/ 
return  TS 

end; 


Figurc4.13  -Picking  the  best  test  set 


6.4  J.  The  algorithm 

Given  a  predicate,  the  compiler  proceeds  by  first  finding  all  test  sets  that  contain  tests  that  are 
implied  by  goals  in  the  predicate.  This  depends  on  the  type  formula  that  is  known  for  the  predicate;  for 
example,  the  unification  X-a  is  only  a  test  if  X  is  nonvaiiablc.  i.e.  if  the  type  formula  implies 
nonvar  (X) .  Then  a  “goodness"  measure  is  calculaied  for  each  test  set,  and  the  test  set  with  the  largest 
goodness  is  used  first.  The  goodness  measure  is  calculated  hcuristically;  in  the  current  implcmenution 
each  test  set  is  weighted  by  an  arehitectutcslependcnt  goodness  (which  depends  on  how  efficiently  it  is 


funclion  codc_tcs(sct(7'5  :  tcstsci;  D  ;  disjunction;  H  :  goal;  F  :  formula;  Previous  :  set  of  lestset) ;  disjunction; 
var  T  :  goal;  ' 

Choices  :  disjunction; 

begin 

Choices  :=  I  ); 

r  At  this  point  TS  =  tcsisct(A/omc .  Idem ,  Tests )  */ 
for  all  T  €  Tests  do  begin 
Dust  ;=  subsume(7.D); 

Dttsi  :=dctcrminism(f>«„,//.updatc_formula(7, f ), f’revioiis  u  (TS)); 
append  '  $test'  (T^Dun)  to  Choices-, 

D  ;=  subsumc(not(r),  Z)) 

end; 

D  ;=  dcterminism(D ,H ,F , Previous  <J  {TS J); 
append  '  Seise'  (D  )  to  Choices: 
return  '  Scase'  {Name ,  Idem ,  Choices ) 

end; 

Figure  4.14-  Converting  a  disjunction  into  a  case  statement 

implemented  in  the  architecture)  and  by  the  number  of  possible  outcomes  (e.g.  hashing  with  a  large  number 
of  cases  is  considered  better  than  a  two-way  branch).  The  predicate  is  converted  into  a  case  statement 
using  the  best  test  set.  The  algorithm  is  called  recursively  for  each  arm  of  the  case  statement  to  build  a 
decision  tree.  This  tree  is  collapsed  into  a  graph  by  the  BAM  transformation  stage. 

Figures  4.1 1  through  4.14  give  a  pseudocode  definition  of  this  algorithm.  The  figures  define  the 
function  determinism(D ,  H ,  F,  Previous)  that  performs  the  determinism  extraction.  Given  a  predicate 
written  as  a  head  //  and  a  disjunction  D ,  along  with  the  type  formula  F  that  is  true  for  that  predicate,  the 
function  finds  as  many  lest  sets  as  possible  in  the  disjunction  and  converts  them  into  case  statements.  It 
returns  a  new  disjunction  that  contains  these  case  statemerus.  The  parameter  Previous  is  used  to  avoid 
infinite  recursion.  It  contains  all  test  sets  that  have  already  been  used  to  make  sure  eadr  test  set  is  only 
used  once. 

The  function  find_iestsets(f>  .H  ,F,  Previous )  returns  a  list  of  all  lest  sets  in  the  disjunction  (Figure 
4.12).  It  picks  a  lest  set  if  there  is  a  goal  in  the  predicate  which  implies  a  test  in  the  test  set.  It  limits  the 
goals  u>  those  that  do  not  bind  any  variables  (bindsci(G^-,  F)  s  0)  and  those  that  use  only  variables  that 
occur  in  the  head  (yars(,G,j)c.vars(,ll)).  The  function  pick_icsisct(75Mt)  returns  the  test  sa  with  the 
greatest  measure  of  goodness,  as  given  by  Equation  (C)  (Figure  4.13).  The  function  codc_iestsci(rS. 
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// ,  F .  Previous)  converts  ihc  disjunction  D  into  a  case  statement  when  given  a  test  set  TS  (Figure  4.14). 
It  uses  the  functions  subsumc(F.  Fi)and  update_forinula(F,  Fi),  which  are  defined  in  section  3. 


Table  4,1 1  —Test  sets 

Name 

Example  Test 

Example  BAM  translation 

equal 

x==y 

(X  or  Y  is  simple  at  run-time) 

equal (X, y, Lbl) 

equal(atomic,A) 

(A  is  an  atom) 

X==A 

equal (X, A, Lbl) 

eqoaI(structurc,F/N) 

(F/N  is  namc/anty) 

• $name_arity' (X,F,N) 

equal ( IX) , F/N, Lbl) 

hash(atomic) 

x^^A  (A  is  atomic) 

hash (tatm, X,N,Lbl) 

hash(structurc) 

X=s  (S  is  a  structure) 

hash(tstr,X,N,Lbl) 

comparison(Class.Kind) 

(Class  e  (eq.  Its.  gis)) 

(Kinde  (arith, 'unify,  sund]) 

x<y 

jump  (Its,  X,  y,  Lbl) 

Type 

(Typee  AllTypcs) 

var (X) 

test (eq,tvac,X,Lbl) 

sw:ich(Typc) 

(Typee  TagTypes  -  (var)) 

atom(X) 

switch (tatm, X, LI, L2, L3) 

Tabic  4.1 1  lists  the  test  sets  currently  recognized  by  the  compiler.  This  includes  unification  goals,  all 


type  checking  predicates,  and  all  arithmetic  comparisons.  For  each  test  set  it  gives  the  name,  a  representa¬ 
tive  test  in  the  test  set  (only  one  is  given,  although  usually  there  are  several  others),  and  the  translation  of 
that  test  into  a  conditional  branch  of  the  BAM  insuriaion  set.  For  the  test  sets  hash(atomic)  and 
hash(structurc)  the  BAM  code  includes  a  hash  table  (not  shown)  in  addition  to  the  hash  insiniaion.  The 
following  definitions  simplify  the  table: 

TagTypcs=  (var,  atom,  svucturc.  cons,  negative,  nonncgativc.  float),  i.c.  all  types  that 
correspond  to  one  tag  in  the  VLSl-BAM  architecture. 

AllT^pcs=  TagTypes  u  (atomic,  integer,  simple,  compound),  i.e.  it  includes  types  that 
correspond  to  more  than  one  tag. 

The  goodness  measure  for  a  test  set  in  a  predicate  is  calculated  using  the  following  rule: 

Goodness  =  1000 -Z?  +  C  (G) 

where  D  is  the  number  of  directions  of  the  test  set  that  occur  in  the  predicate  and  G  is  the  taw  goodness 
measure  of  the  test  .set.  This  rule  ensures  that  the  number  of  useful  directions  in  the  testset  is  most  impor- 
unt  The  raw  goodness  is  used  only  when  the  number  of  directions  is  the  same.  Table  4.12  gives  the  taw 
goodness  of  all  test  sets  in  the  VLSl-BAM  architecture  (34).  with  a  brief  justification  of  the  ranking.  The 
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Table  4.12  -  Raw  goodness  measure  of  test  sets  in  the  VLSI-BAM 

Test  set 

Rank 

Comments 

swiichfcons) 

switch(structurc) 

131 

Switch  is  best  because  it  is  fast  and  it  is  a  three-way  branch,  so  it 
gives  the  most  information.  Switch  of  compound  terms  is  beucr 
than  other  switches  because  it  makes  traversing  a  recursive  term 
(like  a  list  or  a  tree)  fast. 

swiich(ncgaiivc) 

swiich(nonncgauvc) 

swiich(aiom) 

130 

Switch  of  atomic  terms  is  worse  because  it  penalizes  the  case  of 
traversing  a  recursive  term. 

switch(intcgcr) 

129 

Switch  of  integer  is  worse  because  the  VLSI-BAM  has  separate 
negative  and  nonnegativc  (tpos  and  tneg)  tags,  requiring  two 
branches.-  , 

var 

atom 

cons  . 
structure 
negative 
nonnegativc 

120 

These  test  sets  are  types  that  correspond  directly  to  lags,  and  there 
exist  fast  two-way  branches  on  tags. 

equal 

85 

This  test  set  requires  two  instructions — a  compare  and  branch,  and 
also  possibly  loading  its  arguments  into  registers. 

equal(atomic.J 

comparison{_,J 

80 

These  test  sets  each  require  two  instructions — a  compare  and 
branch. 

integer 

atomic 

compound 

79 

These  test  sets  are  types  that  each  correspond  to  two  tags,  so  they 
need  two  tag  checks. 

equal(structure,J 

60 

Equality  comparison  of  a  structure's  functor  &  ariiy  needs  a 
memory  reference. 

simple 

50 

This  test  set  corresponds  to  a  type  that  needs  five  tag  checks  (four 
without  floating  point). 

hash(atomic) 

41 

Hashing  is  the  slowest  because  it  needs  to  calculate  the  hash  ad¬ 
dress. 

hashfstructurc) 

40 

Hashing  on  a  structure  is  slight!}  slower  than  hashing  on  an  atomic 
term  because  a  memory  load  is  needed  to  access  the  main  functor 
of  the  structure,  whereas  the  atomic  term  is  directly  available  in 
the  register. 

value  of  the  rank  is  not  important;  only  the  relative  order  is  imporrant  Architectures  rank  the  test  sets 
according  to  how  efficiently  they  are  implemented  in  the  architecture.  To  compile  for  a  different  architec¬ 
ture.  only  the  ranking  is  changed  in  the  compiler.  The  tanking  is  modified  for  other  processors  by  a  com¬ 
piler  option.  For  example,  for  the  MIPS  processor,  the  option  mips  changes  the  ranking  to  make  the  test 
set  equal  (atomic,  ( ) )  best.  i.e.  a  comparison  with  the  atom  ( ]  (nil),  because  it  can  be  implemented 
with  a  single-cycle  conditional  branch  instruction.  The  MIPS  docs  not  have  sqiaratc  tags  for  negative  and 
nonnegativc  integers,  so  the  test  sets  negative  and  nonncgaiivc  arc  not  implemented  as  efficiently  as  on  the 
VLSI-BAM.  These  two  test  sets  have  lower  ranks. 
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Chapter  5 

Compiling  Kernel  Prolog  to  BAM  Code 

1.  Introduction 

The  previous  chapters  described  the  conversion  of  standard  Prolog  to  kernel  Prolog  and  the  optimiz¬ 
ing  kernel  transformations.  This  chapter  shows  how  the  optimized  kernel  Prolog  is  compiled  to  BAM 
code.  The  compilation  to  BAM  is  performed  in  two  steps  for  each  predicate.  In  the  first  step,  the  control 
instructions  that  make  up  the  framework  of  the  predicate  are  compiled  by  the  predicate  compiler.  This 
includes  compiling  the  deterministic  case  sutements  into  conditional  branches  and  the  disjunctions  with 
choice  point  instructions. 

In  the  second  step,  the  clauses  that  make  up  the  body  of  the  i^icate  are  compiled  by  the  clause 
compiler.  The  clause  compiler  uses  two  primitives,  the  goal  compiler  and  the  unification  compiler,  to  com¬ 
pile  goals  and  explicit  unifications.  The  clause  compiler  also  does  register  allocation,  entry  specialization 
(replacing  built-in  predicates  by  faster  entry  points),  and  performs  the  wiite-once  transformation  (for  fast 
trailing),  and  the  dereference  chain  transformation  (to  maintain  consistency  with  the  rlataflow  analysis). 
Tliese  transformations  are  explained  in  detail  in  the  sections  below. 

2.  The  predicate  compiler 

In  the  kernel  transformation  stage  ((Thapicr  4).  determinism  extraction  attempts  to  convert  each 
predicate  into  a  series  of  nested  case  statements.  This  is  not  always  successful;  sometimes  the  case  state¬ 
ments  still  retain  disjunctions  (OR  choices)  that  could  not  be  converted  into  deterministic  code.  The  predi¬ 
cate  compiler  compiles  both  the  case  statements  and  the  disjunctions  into  BAM  code.  The  case  statements 
arc  compiled  into  conditional  branches.  The  disjunctions  are  compiled  into  dioice  point  instructions.  The 
predicate  compiler  uses  two  primitives,  the  determinism  compiler  and  the  disjunction  compiler,  to  compile 
the  predicate’s  case  statements  and  disjunctions. 
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2.1.  The  determinism  compiler 

Compiling  a  kernel  Prolog  predicate  into  dcterrqinistic  BAM  code  is  done  in  two  steps.  First,  the 
determinism  transformation  (a  kernel  Prolog  transfoimation.  Chapter  4)  converts  a  kernel  Prolog  predicate 
into  a  scries  of  nested  case  statements.  Then  the  determinism  compiler  compiles  the  nested  case  statements 
into  BAM  code.  A  case  statement  may  contain  any  test  set.  and  each  test  set  is  mapped  to  a  conditional 
blanch.  The  test  sets  and  their  corresponding  conditional  branches  arc  given  in  Table  4.1 1. 

2.2.  The  disjunction  compiler 

A  disjunction  (an  OR  formula)  is  a  list  of  clauses  that  encapsulates  a  choice.  The  first  clause  is  exe¬ 
cuted  the  first  .time  the  disjunction  is  encountered.  The  remaining  clauses  are  executed  in  order  on 
backtracking — each  time  backtracking  returns  to  the  disjunction  the  next  clause  is  tried.  This  is  imple¬ 
mented  by  code  which  generates  choice  points.  A  choice  point  encapsulates  the  state  of  the  abstract 
machine  at  the  time  it  is  created.  Backtracking  restores  machine  state  from  a  choice  point  to  let  execution 
continue  from  the  point  at  which  the  choice  point  was  created. 

Creating  and  restoring  machine  state  in  choice  points  is  time-consuming.  To  minimize  the  size  of  the 
choice  points  (and  hence  the  time  required  to  create  them),  the  choice  point  management  instructions  in  the 
BAM  arc  streamlined  to  perform  the  least  amount  of  data  movemen..  They  save  only  those  registers  that 
are  needed  in  the  clauses  of  the  disjunction  after  the  first,  and  for  each  clause  of  the  disjunction  they  restore 
only  those  registers  that  arc  needed  in  that  clause.  Argument  registers  arc  restored  in  the  clause  itself  and 
not  in  the  fail  instruction.  Therefore  the  size  of  the  choice  point  does  not  have  to  be  stored  in  the  choice 
point  and  decoded  in  the  fail  instruction.  A  disadvanuge  is  a  slightly  larger  code  size.t  Consider  the 
following  kernel  Prolog  for  a  predicate  P  with  n  clauses: 

Head  (  C|  ;  C2  ;  —  ;  C.  .*  fail). 

A  single  choice  point  is  created  for  each  invocation  of  P .  The  set  of  registers  saved  in  the  choice  point  is 
the  set  of  all  head  arguments  that  are  used  in  clauses  after  the  first,  i.c.  C2  through  C..  Arguments  that 

t  7>iif  if  Icsf  of  a  problem  in  tht  VLSl-BAM  tinoe  ihc  insinioion  reordcrer  merget  pain  of  singlc-utKd  loads  into 
double-word  kuids 


occur  only  in  clause  C i  do  not  have  to  be  stored  in  the  choice  point  The  set  of  registers  that  is  restored  for 
each  clause  is  the  set  of  arguments  used  in  that  clause. 

Before  creating  the  choice  point,  the  compiler  dereferences  those  arguments  that  it  can  deduce  will 
be  dereferenced  later.  This  avoids  dereferencing  the  same  argument  more  than  once.  The  set  of  arguments 
to  be  dereferenced  is  derived  by  checking  the  type  formula  corresponding  to  each  goal  in  the  body  of  the 
predicate's  dcfinttion.  and  noting  whether  its  arguments  have  to  be  dereferenced.  For  example,  arithmetic 
operations  and  relational  tests  arc  goals  that  require  their  arguments  to  be  dereferenced. 

To  illustrate  the  compilation  scheme,  consider  the  following  predicate: 

p(A,B.C,D)  (  a(A) 

.  ;  c  (C) 

•  ;  d(D) 

;  fail 
)  . 

It  is  compiled  as: 

procedure (p/4) . 

choice (1/3, (2. 3] . 1 (p/4<2) ) .  ;  Save  registers  r(2)  and  r(3). 

jump (a/ 1) . 
label (1 (p/4, 2))  . 

choice (2/3, (2,  no) ,  1 {p/4.  3) ) .  ;  Restore  only  register  r(2). 

move  (r  (2) )  r  (0)  )  . 
jump (c/1)  . 
label (1 {p/4,3) )  . 

choice (3/3, [no. 3] , fail) .  ;  Restore  only  register  r(3). 

move (r(3),r(0)). 
jurr^  (d/1)  . 

The  choice  instructions  do  all  the  choice  point  manipulation:  choice  (1/3, _ )  creates  the  choice 

point,  choice  (2/3, _ )  modifies  the  address  to  return  to  on  backtracking,  and 

choice  (3/3,  . . . )  removes  the  choice  point.  Register  r  (0)  is  not  saved  in  the  choice  point  because 
it  is  not  needed  in  clauses  beyond  the  first.  The  second  and  third  clauses  restore  only  the  roisters  they 
need.  Register  r  ( 1 )  is  not  saved  because  it  is  not  needed  at  all. 

Each  choice  instruction  contains  a  list  of  the  registers  thai  it  used.  The  length  of  the  list  is  the  same 
for  all  choice  instructions  in  a  predicate.  For  choices  after  the  first,  the  atom  no  is  put  in  the  positions  of 
registers  that  do  not  have  to  be  icstorcd.  For  example,  the  list  (0,  no,  5]  means  that  registers  r  (0) 
and  r(5)  arc  restored  from  the  first  and  third  locations  in  the  choice  point,  and  the  second  location  is  not 
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accessed. 

In  this  example  a  further  optimization  can  be  done  by  merging  the  move  instructions  with  the  choice 

t 

instructions,  i.e.; 

choice (3/3, [no, 3] , fail ) . 
move (r(3j,i(0)). 

becomes: 

choice  (3/3,  (no.  0]  .’fail)  . 

This  is  possible  because  the  value  loaded  in  a  register  is  determined  by  its  position  in  the  list,  mn  by  its 
number,  and  because  register  r(3  )  is  only  used  to  load  r(0). 


Figure  S.I  -  Structure  of  the  clause  compiler 


3.  Th(  clause  compiler 


The  clause  compiler  convens  a  clause  from  k^oel  Prolog  form  (with  type  annotations)  to  BAM 
code.  The  structure  of  the  clause  compiler  is  given  ia  Figure  S.l.  After  compiling  the  goals  in  the  body 
there  are  two  intermediate  results:  (I)  BAM  code  in  which  variables  have  not  yet  been  allocated  to  registers 
{skeleton  code)  and  (2)  a  variable  occurrence  list  (the  varlist),  that  contains  all  unallocated  variables  in  the 
skeleton  code.  The  final  BAM  code  is  obtained  by  passing  the  varlist  to  the  register  allocator. 

Each  goal  in  the  clause  body  is  compiled  in  four  steps.  First,  three  transformations  are  performed  on 

the  goal:  entry  specialization,  the  write-once  transformation,  and  the  dereference  chain  transformation. 

Then  the  goal  is  compiled  into  BAM  code  by  one  of  two  routines,  the  unibcation  compiler  or  the  goal  com- 
« 

piler,  depending  on  whether  the  goal  is  a  unification  or  not. 

These  arc  dic  important  blocks  in  the  clause  compiler. 

(1)  The  goal  compiler.  Its  main  task  is  to  handle  argument  passing.  Because  of  the  interaction  between 
the  different  kinds  of  unbound  variables,  initialized  and  uninitialized,  this  results  in  a  case  analysis, 
in  addition,  the  goal  compiler  compiles  in-Une  some  built-in  predicates  and  the  dummy  predicates 
that  were  created  in  the  transformation  to  kernel  Prolog. 

(2)  The  unification  compiler.  Its  task  is,  pven  a  type,  to  compile  an  explicit  unification  into  the  sim¬ 
plest  possible  code. 

(3)  The  register  allocator.  Its  task  is  to  allocate  variables  to  registers  in  such  a  way  that  the  number  of 
superfluous  move  instruaions  is  minimized.  It  uses  a  data  structure  called  the  varlist  which  is  gen¬ 
erated  by  the  clause  body  compiler. 

(4)  Entry  specialization.  This  attempts  to  replace  each  goal  in  the  clause  by  a  faster  entry  point, 
depending  on  the  types  known  at  the  call. 

(5)  Write-once  transformation.  This  transformation  is  part  of  a  technique  for  reducing  the  overhead  of 
trailing. 

(6)  Dereference  chain  transformation.  This  transformation  is  necessary  to  keep  the  dauflow  analysis 


and  the  clause  compiler  consistent. 
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The  following  sections  give  more  details  about  these  each  of  these  blocks.  First,  an  example  of  a  clause 
compilation  is  given,  with  emphasis  on  the  skeleton  code,  the  varlist,  and  a  specification  of  the  register 
allocator.  This  is  followed  by  discussions  of  the  goal  compiler,  the  unification  compiler,  entry  specializa¬ 
tion,  the  whie-oncc  transformation .  and  the  dereference  chain  transformation. 


3.1.  Overview  of  clause  compilation  and  register  allocation 


This  section  gives  an  example  of  how  a  clause  is  compiled.  Consider  the  following  clause  with  no  types: 


a(A.B)  b(A,C)  ,  d(C,B)  . 


Compilation  of  this  clause  proceeds  in  three  steps;  First  the  kernel  Prolog  is  compiled  to  BAM  code  and  a 
variable  occurrence  list,  or  varlist.  In  this  example,  most  of  the  work  in  this  step  is  done  in  the  goal  com¬ 
piler.  The  resulting  BAM  code  is  referred  to  as  skeleton  code  since  variables  have  not  yet  been  allocated  to 
registers.  The  varlist  is  derived  from  the  skeleton  code  and  contains  the  list  of  variables  and  registers  in  it. 
Second,  the  register  allocator  uses  the  varlist  to  allocate  variables  to  registers.  Third,  after  all  predicates 
and  all  clauses  are  compiled,  the  BAM  optimization  stage  improves  the  code  (Chapter  6).  The  skeleton 
code  for  this  clause  is: 


allocate (X) . 
move  <r  (0) ,  A)  , 
move (r  (1) ,  B)  . 
move (tvar'r (h) , CJ . 
move (tvar'r th) ,  D) . 
pragma (push (variable) ) 
push (D, r  (h) ,  1)  . 
move (A, r  (0) )  . 
move (D, r  (1) )  ■ 
call(b/2)  . 

pragma (tag (C.tvar) I . 
move( [C] , r (0) ) . 
move  (B.  r  (1 ) )  . 
call (d/2) . 
deallocate (X) . 
return . 


Create  an  environment  (its  size  is  still  un)cnown)  . 
load  the  head  arguments  into  variables  A  and  B. 

Create  an  unbound  variable  and  put  it  in  C  and  D. 

C  may  exist  beyond  a  call,  D  exists  between  calls. 


Load  the  parameters  of  the  first  call. 


C  has  an  extra  lin)c,  with  a  tvar  tag. 

Extra  indirection  to  remove  the  extra  lin)(. 


No  last  call  optimization  in  the  siceleton  code. 


The  varlist  for  this  clause  is: 


(pr«£, t(OI ,A. 
ptef . t (1) , B. 
C.pref.C,  D,D, 
pref . A, r (0) . 
pref.D. r (1) , 
fence, 

C,  r  (0) , 
pref , B, r (1) , 
fence) 


;  Corresponds  to  move (r (0) , A) . 

;  Corresponds  to  the  unbound  variable  in  C  and  0. 


;  Corresponds  to  call(b/2). 


;  Corresponds  to  call (d/2). 


3.1.1.  Construction  of  the  varlisl 

The  varlist  is  constructed  to  satisfy  these  conditions: 

(1)  The  only  contents  of  the  varlist  are  unbound  variables,  lempotaiy  registers,  and  the  atoms  fence 
and  pref. 

e 

(2)  The  order  of  variable  occurrences  is  the  same  in  the  skeleton  code  and  the  varlist. 

(3)  .  The  atom  fence  is  inserted  as  a  marker  at  each  point  where  temporary  variables  do  not  survive. 

This  corresponds  lo  each  calK..)  instrucuon  in  the  skeleton  code. 

(4)  Two  variables  that  are  preferably  allocated  to  the  same  roister  are  preceded  by  the  atom  pref  and 
called  a  pr^ pair.  A  pref  pair  is  created  vdien  allocating  the  variables  to  the  same  register  allows  an 
instruction  to  be  removed.  For  example,  the  move  (A,  r  (0) )  instruction  can  be  removed  if  the 
variable  A  is  allocated  to  register  r  ( 0 } . 

(5)  A  variable  occurs  exactly  once  in  dte  varlist  if  and  only  if  it  occurs  exaaly  once  in  the  skeleton  code. 
Such  a  variable  is  called  a  void  variable.  An  instruction  containing  a  void  variable  may  be  removed. 

(6)  A  variable  occurs  mote  than  once  in  the  varlist  if  and  only  if  it  occurs  more  than  once  in  the  skeleton 
code. 


3.1.2.  The  register  allocator 
« 

The  register  allocator  assigns  a  register  to  each  variable  in  the  varlist  such  that  there  ate  no  conflicts, 
i.e.  a  single  rcgi.ster  never  holds  two  values  at  the  same  time.  The  allocator  also  calculates  the  size  of  the 
environment  (the  number  of  permanent  roisters)  for  the  allocate  and  deallocate  instructions. 
The  algorithm  is  defined  in  Figure  S.2.  It  assumes  that  variables  are  represented  as  logical  variables,  i.c. 
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procedure  rcgisicr_allocaior(UL  ;  varlist); 
var  ^  ump « ^^prtf « ^ funn  •  sct  of  vafiablc,' 

begin 

^'oo,<t  '■=  1  variable  Y  I  Y  occurs  exactly  once  in  VL ); 

V  A'  €  do  Allocate  each  A  to  r  (void) ; 

^'ptrm  I  variable  Y  I  The  sequence  [Y, ....  fence . Y\  occurs  in  VZ.  |; 

V  A  e  do  Allocate  each  A  to  a  different  p  ( I ) ; 

Environment  size  :=  number  of  elements  in  V'p,^ ; 

Vuwyp  ■-=  (  variable  Y  I  K  occurs  more  than  once  in  V^L  1; 

Vp„f  :=  prcfcr{l'’Z.); 

while  Vump  *  0  do  begin 

while  3  A  e  :  A  is  allocatabic  to  r  ( 1 )  without  conflict  do  begin 
Allocate  A  to  its  preferred  register  r  ( I ) ; 

VV/  :=  -  {A  ); 

Vu^-=Vu^-[Xy, 

Vpr.f  :=prefcr(VL) 

'  end; 

if  3  A  €  V,„p  then  begin 

Allocate  A  to  the  lowest  r  (1)  possible  without  conflict; 

Vpr.f  -v^.f-[xy 
Vu^.=  v^-[xy 

Vp„f  :=^Tc(ti(.VL) 

end 

end 

end; 

function  prefcr(VL  :  varlist) :  set  of  variable; 
begin 

return  {  variablcT  IThc  sequence  [pref,  y,_)  or  [  pref._,  yj  occurs  in  VX ) 

end; 

Figure  S.2  -  The  register  allocator 

(hat  allocating  a  variable  to  a  register  binds  that  variable  in  all  sets  that  contain  it  It  assumes  that  there  are 
an  infinite  number  of  temporary  and  permanent  registers.  It  uses  the  following  correspondence  between 
variable  lifetimes  and  registers: 

(1)  A  variable  that  occurs  exactly  once  is  allocated  to  r(void). 

(2)  A  variable  occurring  on  both  sides  of  a  fence  marker  (it  crosses  a  fence)  is  allocated  to  a  per¬ 
manent  register  p  ( I )  (a  location  in  the  environment). 

(3)  A  variable  that  does  not  cross  a  fence  and  that  occurs  more  than  once  is  allocated  to  a  temporary 
register  r  ( l ) . 

The  algorithm  is  independent  of  the  write-once  transformation  and  the  dereference  chain  transformation. 
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This  is  possible  because  the  clause  compiler  is  careful  to  feed  the  allocator  a  varlist  that  takes  the  two 
transformations  into  account. 

i 

In  the  example  of  the  previous  section,  the  allocator  assigns  the  following  values  to  the  variables; 


A  -  r(0) 
B  -  p(0) 
C  -  p(l) 
D  -  r(l) 
X  -  2 


Since  both  B  and  C  cross  a  fence,  they  are  allocated  to  permanent  registers.  Both  A  and  D  are  allocated  to 
their  preferred  registers.  The  number  of  permanent  variables.  X,  is  2. 


3.U.  The  finabresult 


The  final  BAM  code  output  by  the  compiler  after  all  transformations  and  optimizations  (including  the  . 

BAM  transformations  of  chapter  6)  is; 

allocate (2).  ;  Allocate  space  for  two  permanent  variables, 

move (r (1) .  p(0) ) . 

move(tvar*r (h) ,  r(l) ) .  ;  Create  an  unbound  variable  and  put  it  in  r(l)  and  p(l). 

move(t (1)  ,p(l) ) . 

pragma (push (variable) ) . 

push(r  (1) ,  r  (h) ,  !)•. 

call(b/2)  . 

pragma (tag (p (1)  ,tvar) ) . 

move ( (p (1) ] , r (0) ) .  ;  Indirection  due  to  dereference  chain  transformation. 

move(p(0) , r(l) ) . 
deallocate (2) . 

jump(d/2).  ;  Last  call  optimization  converts  'call'  to  ' jun^' . 


3.2.  The  goal  compiler 

Given  a  goal  and  type  information  about  the  goal,  this  module  sets  up  the  arguments  to  call  the  goal, 
docs  the  call,  and  sets  up  the  return  arguments.  The  main  task  of  the  goal  compiler  is  to  handle  the  com¬ 
plexities  that  arise  when  supporting  combinations  of  uninitialized  and  initialbxd  parameters.  The  follow¬ 
ing  situations  arc  also  handled; 

(1)  Duplicate  variables.  An  uninitialized  variable  that  occurs  twice  in  a  goal  must  be  initialized  before 


calling  the  goal. 
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(2)  Uninitialized  register  variables.  Passing  arguments  as  uninitialized  register  variables  requires 
some  care.  These  variables  are  not  passed  into  a  predicate,  but  are  outputs  returned  in  registers. 

(3)  Dummy  predicates.  Several  compiler  transformations  create  new  predicates  as  part  of  the  transfor¬ 
mation.  These  predicates  are  only  called  once,  so  they  are  compiled  in-line. 

(4)  Built-in  predicates.  Some  built-in  predicates  arc  translated  into  in-line  code  (Table  S.S). 


function  compilc_goal(G  :  goal;  F  ;  formula;  V,f  :  set) ;  return  {Code  :  list;  Foyt  '■  formula;  :  set); 
var  :  set  of  variable; 

Initcode ,  Precode .  Call ,  Postcode  :  list  of  instruction; 

,  A  :  term; 

gi,r,  :  (ini, mem,  reg); 
i  :  integer; 
begin  • 

r  Initialize  all  uninitialized  variables  that  are  duplicated  */ 

Pi»nii  :={  X  I  F  implies  (uninit_mem(X  )  or  uninit_rcg{X  ))  ); 

:=((vars(G)-Vj/)  u  n  dups(G);  /•Table 4.6*/ 

Initcode  ;=  list  of  ( V  X  e  V.«,  :  Code  to  initialize  the  variable  X ); 

/•  Pass  arguments  to  the  goal  and  clean  up  afterwards  */ 

Precode  ;=  ( J; 

Postcode  ;=  ( ]; 
for  i  :=  1  to  arity(G )  do  begin 
A  :=  (argument  i  of  goal  G ); 
g,  :=  given_flag(A  ,F,V,f  );  {*  Table  5.1  •/ 
r,  :=  requirelflag{/4 ,  C  );  /*  Table  5.2*/ 

Append  precodej  g,  ,  r,  )  to  Precode ;  !*  Table  5.3  */ 

Append  postcode!  g,  ,  r,  ]  to  Postcode  /*  Table  5.4  •/ 

end; 


end; 


/•  Call  the  goal  */ 

if  {G  can  be  expanded  in-line)  then 

Call  :=  (in-line  expansion  of  C )  /*  Table  5.5  •/ 
else  if  (G  is  a  dummy  predicate)  then 

Call  :=  (in-line  compilation  of  C ’s  definition) 
else  if  (G  does  not  alter  temporary  registers)  then 

Call  :=  (a  simple_call  instruction  for  G )  /*  Table  3.7  */ 
else 

Call  :=  (a  ca  1 1  instruction  for  G  ); 

Code  :=  appcnd(/nitcodc ,  Precode ,  Call ,  Postcode  ) 

Figure  5.3  -  The  goal  compiler 


The  function  compilc_goal(G  ,F,V,/)  defines  the  goal  compiler  (Figure  5.3).  Its  inputs  arc  the  goal  (G ).  a 
type  formula  (F),  and  the  set  of  variables  that  have  a  value  on  input  (V',/).  Its  outputs  arc  a  list  of  BAM 
instructions  {Code ),  the  type  formula  true  on  output  (Fm  ),  and  the  set  of  variables  that  have  a  value  on 
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output  (V,/  .OU(  )• 

Each  goal  has  three  type  formulas  associated  with  it;  a  Require  type,  a  Before  type,  and  an  After  type. 
These  types  are  optionally  given  by  programmer  input  and  are  supplemented  by  dataflow  analysis.  The 
compiler  maintains  a  table  of  these  types  for  all  predicaus  including  built-ins  and  internals.  The  Require 
type  gives  the  types  that  the  arguments  being  passed  to  the  goal  must  have.  i.e.  the  goal  compiler  is 
required  to  make  them  true  in  all  cases.  The  Before  type  gives  the  types  that  are  true  before  the  call.  The 
After  type  gives  the  types  that  are  true  after  .the  call  returns.  No  special  action  is  needed  by  the  goal  com¬ 
piler  to  ensure  the  validity  of  the  Before  and  After  types. 

Compiling  a  goal  is  made  more  complex  because  the  kind  of  argument  needed  by  the  goal  may  not 
be  the  same  as  the  one  that  is  given  to  it.  The  goal’s  Given  type  (which  is  valid  before  the  goal  and  given 
by  F  in  Figure  5.3)  must  be  reconciled  with  the  goal's  Require  type.  The  most  common  Require  and 
Given  types  arc  the  three  varieties  of  unbound  variables:  uninitialized  memory  and  register  variables  and 
initialized  variables.  This  requires  a  case  analysis  with  3x3  cases  for  each  argument  of  the  goal  to  prop¬ 
erly  match  the  Require  and  Given  types. 


Table  S.  1  -  Calculating  the  Given  flag  of  an  argument 

Condition  on  argument  A 

f 

nonvar(A ) 

ini 

var(A )  A  (F  implies  uninit_mem(A  )) 

mem 

var(A)  A  ((A  e  V,/)  v  (F  implies  uninit_rcg(A ))) 

reg 

var(A)  A  (A  €  V,f) 

ini 

Tabic  5.2  -  Calculating  the  Require  flag  of  an  argument 

Condition  on  argument  A 

r, 

requirc(G )  implies  uninit_mcm(A ) 

mem 

rcquirc(G )  implies  uninit_reg(A ) 

reg 

otherwise 

ini 
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Table  5.3  -  Calculating  the  prccodc  from  the  flags 

g. 

prccodel  g,  .  r,  1 

reg 

reg 

u  ■■ 

mem 

reg 

n 

ini 

reg 

n 

reg 

mem 

[mov€(tvar'r(h),B), adda (r(h),l,r(h))] 

mem 

mem 

11 

ini 

mem 

(move (tvar‘r(h),B), adda (r(h),l,r(h))] 

reg 

ini 

(move (tvar‘r{h),B),push(B,r(h),l) ) 

mem 

ini 

(move {A,  (A] ) .move (A,B) ] 

ini 

ini 

(1 

Table  5.4 

-  Calculating  the  postcode  from  the  flags 

g. 

r. 

posicodcl  g,  .  r,  ) 

reg 

reg 

n 

mem 

reg 

(move  {B,  [A] ) ) 

ini 

reg 

unify(A,B) 

reg 

mem 

(move IB,A) ) 

mem 

mem 

n 

ini 

mem 

unify(A,B) 

reg 

ini 

(move  (B,A)  ] 

mem 

ini 

(3 

ini 

ini 

(3 

Require  and  Given  flags  r,  and  g,  (with  vaiues  in  {irji,  mem,  regj)  arc  associated  with  each  goal 
argument  for  the  Require  and  Given  types.  Tables  5.1  and  5.2  define  how  the  Require  and  Given  flags  arc 
calculated.  The  function  requirc(G )  in  Table  5.2  is  a  defined  predicate  in  the  compiler  that  returns  the 
Require  type  for  any  goal.  It  knows  all  about  built-in  and  internal  predicates  and  the  results  of  dataflow 
analysis. 

Duplicate  arguments  (c.g.  A  in  the  call  p(/4 ,  A  ))  are  treated  specially.  An  argument  that  is  duplicate 
cannot  be  uninitialized — it  occurs  in  more  than  one  place,  so  it  is  not  unaliased  any  more.  The  goal  com¬ 
piler  initializes  these  arguments  before  doing  the  case  analysis. 

Table  5.3  gives  the  prccodc,  i.e.  the  code  that  is  generated  before  (he  call  to  set  up,  and  Table  5.4 
gives  the  postcode,  i.e.  the  code  that  cleans  up  after  the  call.  To  enforce  the  Require  type,  in  seven  of  the 
nine  cases  a  different  argument  B  is  passed  to  (he  call  instead  of  the  goal's  original  argument  A .  For 
example,  if  the  Given  flag  is  mem  and  the  Require  type  is  reg,  then  the  compiler  must  create  a  new  vari¬ 
able  B  of  type  uninit_reg(S )  to  pass  to  (he  goal.  After  the  goal  returns,  the  original  argument  A  and  the 
returned  argument  B  arc  unified  together.  The  new  variable  B  is  created  for  all  comoinations  of  Given  and 


Require  flags  except  ( reg  ,  reg)  and  (mem ,  mem).  In  these  two  cases  no  precode  or  postcode  is  needed. 


To  simplify  the  presentation.  Figure  5.3  only  docs  part  of  what  the  algorithm  implemented  in  the 
compiler  docs.  The  definition  of  compile_goal  in  the  figure  only  handles  Require  and  Given  types  that  are 
all  uninitialized  variables.  The  actual  algorithm  handles  any  types.  The  type  formula  F  and  the  variable 
set  V,j  are  updated  continuously  during  the  execution  of  compilc_goal.  A  variable  occurrence  list  is  calcu¬ 
lated  for  the  register  allocator.  The  actual  algorithm  handles  12  cases  for  parameter  passing  instead  of  9 — 
as  an  optimization,  two  vaneiics  of  Given  uninitialized  register  types  are  recognized. 


Table  5.5  -  BAM  expansion  of  internal  built-ins 

Kernel  Prolog 

BAM  instruction 

'  Scut_load'  (X) 

move (r (b) , X) 

''5cut'  (X) 

cut(X) 

'  $name  arity'  (X, '  . ' , 2) 

test (ne, tlst,X, fail) 

'  $name_arity'  (X,Na,Ar) 

equal  I (X] , tatrn" (Na/Ar) , fail) 

'  $name  arity'  (X, Na , 0 ) 

equal (X, tatm'Na, fail) 

' $test' (X, Types) 

(a  sequence  of  test  instructions) 

' $equal' (X,y) 

equal (X, Y, fail) 

' $add' (A,B,C) 

add(A,B,C) 

' $sub' (A, B,C> 

subfA, B,C) 

' $mod' (A,B,C) 

modlA, B,C) 

' $mul' (A,B,C) 

mul (A, B,C) 

' $div' (A,B,C) 

div{A,B,C) 

' Sand' (A,B,C) 

and (A, B, C) 

'Sor'  (A,B,C) 

or (A,B,C) 

' Sxor' (A,B,C) 

xor (A, B, C) 

' Ssll' (A,B,C) 

sll(A,B,C) 

' Ssra' (A,B,C) 

sra (A, B, C) 

' Snot' (A,C) 

not (A,C) 

3.2.1.  An  example  of  goal  compilation 


This  section  gives  a  simple  example  of  compilation  to  show  how  the  goal  compiler  works  in  practice. 
Consider  the  following  predicate  in  standard  Prolog: 

a(X,  Y)  Y  is  X+1 . 

This  is  convened  to  kernel  Prolog: 

a  IX,  Y)  '  Sadd*  (X,  1.  Y)  . 


To  compile  the  call  to  '  $add'  /3  it  is  necessary  to  pass  parameters  in  the  right  way.  In  particular,  it  is 
necessary  to  pass  the  output  of  the  addition  into  variable  Y.  The  built-in  '  $add'  (A,  B,C)  has  the 


following  types  associated  with  it: 


Requires  (deref  (A)  ,deref  (B)  ,uninit_reg(C) ) . 

After  =  (integer  (A) ,  integer  (B)  ,  integer  (C) ,  rderef  (A)  ,  rderef  (B)  ,  rderef  (C) ) . 
From  the  Require  type,  the  first  two  arguments  X  and  ]  of  '  §add'  /3  must  be  dereferenced  and  the  third 
argument  Y  must  be  an  uninitialized  register.  The  Given  types  of  X  and  Y  depend  on  the  type  fonnula  for 
a  (X,  Y) .  Assume  first  that  no  type  is  given  for  a  (X,  Y) .  From  Tables  5.1  and  5.2,  the  Given  flag  for  Y 
is  ini  and  the  Require  flag  for  Y  is  reg.  From  Tables  5.3  and  5.4,  the  precodc  in  this  case  is  empty  and 
the  postcode  is  a  call  to  unify(A  ,B )  to  generate  unification  code.  The  compiled  BAM  code  is: 


procedure (a/2)  . 

deref (r(O).r(O)) . 

add  (r  (0)  ,  1,  r  (0) )' . 

deref (r(l),r(l))  . 

unify (r(0) , r(l) .nonvar. ?,fail) •  ; 

return . 


Dereference  X. 

Perform  the  addition. 

Dereference  Y. 

Unify  Y  with  the  result  of  the  addition. 


If  a  (X,  Y)  has  a  ty^ic  then  the  code  can  often  be  simplified.  For  example,  assume  that  its  type  is 
(deref  (X) ,  uninit_n)em(Y) ),  i.e.  X  is  dereferenced  and  Y  is  an  uninitialized  memoiy  variable. 
Then  the  Given  flag  for  Y  is  mem.  The  compiled  BAM  code  is: 


procedure  (a/2)  . 

add (r (0) , 1, r (0) ) .  ;  Perform  the  addition  (X  is  dereferenced), 

pragma (tag(r (1) .tvar) ) . 

move (r (0) , I r (1) 1 ) .  ;  Bind  Y  to  the  result  of  the  addition, 

return. 


3  J.  The  unification  compiler 

This  section  gives  an  overview  of  the  compilation  of  uniScation,  the  optimizations  that  are  done,  and 
several  examples. 

3J.1.  The  unification  algorithm 

Given  a  unification  goal  and  type  information  about  its  arguments,  this  algorithm  generates  the  sim¬ 
plest  possible  code  to  implement  the  unification.  In  the  general  case,  the  algorithm  builds  a  tree  of  insvuc- 
lions.  Each  node  of  the  tree  has  three  branches— one  each  for  read  mode  and  write  mode  unification,  and 
one  for  failure.  The  algorithm  generates  dcreferatcc  instnictions  if  necessary  and  trail  instrua'ions  to  undo 
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variable  bindings  when  backtracking.  It  does  other  optimizations  including  optimal  write  mode  unification, 
type  propagation,  and  depth  limiting. 

Write  mode  unification  of  a  term  generates  a  block  of  push  instructions  that  builds  the  term  on  the 
heap.  Read  mode  unification  of  a  term  is  done  sequentially  for  each  of  the  term’s  arguments.  First  it 
checks  the  name  and  arity  of  the  term.  Then  the  arguments  are  unified.  For  arguments  that  are  simple 
terms  this  consists  of  a  single  move,  equal,  or  unify  instruction.  FOr  arguments  that  are  compound 
terms  the  unification  algorithm  is  called  recursively. 

The  function  unify(X,  Y ,  F  ,V,f)  defines  the  unification  algorithm  (Figure  S.4  and  S.5).  Its  inputs 
arc  the  two  terms  to  be  unified  {X  and  F),  the  type  formula  true  on  input  (F).  and  the  set  of  variables  that 
have  a  value  on  input  ).  Its  outputs  are  a  list  of  BAM  instructions  (Code),  the  type  formula  true  on 
output  (Fo„, ),  and  the  set  of  variables  that  have  a  value  on  output 

*  Tlic  algorithm  does  several  tasks  that  are  not  shown  in  the  figure  since  they  would  unnecessarily 
complicate  the  presenution.  The  instruaion  list,  the  type  formula,  and  the  variable  set  are  updated  con¬ 
tinuously  during  the  compilation.  Before  using  the  value  of  a  variable,  it  is  dereferenced  if  necessary. 
Before  binding  a  value  to  a  variable,  it  is  trailed  if  necessary.  A  variable  occurrence  list  (varlist)  is  calcu¬ 
lated  for  the  register  allocator  (Figure  S.2). 

33,2.  Optimizations 

The  actual  implementation  docs  four  optimizations  not  shown  in  Figure  5.4  and  5.5.  It  does  optimal 
write  mode  unification.  It  keeps  track  of  terms  that  are  ground  and  recursively  dereferenced  to  avoid  com¬ 
piling  superfluous  write  mode  unifications  and  dereferences.  To  reduce  code  size,  it  performs  the  last  argu¬ 
ment  optimization  and  the  depth  limiting  transformation. 

3J.2.1.  Optimal  write  mode  unification 

The  algorithm  is  modified  to  build  a  compound  term  in  write  mode  with  the  least  number  of  move 
insuuctions.  First  the  code  for  building  the  main  functor  with  empty  slots  for  its  arguments  is  generated. 
This  is  followed  by  the  code  for  building  the  argumems  and  filling  in  the  slots  with  the  correct  heap  offsets. 
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function  un!fy(X  ,  V :  icrm;  F  :  formula;  V,;  :  set)  return  (Code  :  list;  :  formula;  K/.9ui  ■  set); 
begin 

Code  :=  (  ); 

if  (var(A')  and  var(l'))  then  begin 

if  (F  implies  (unbound(X )  or  unbound(>')))  then 

Compile  a  store  instruction 
else 

Compile  a  call  to  a  general  unification  subroutine; 
return 

end  else  if  (nonvar(X )  and  nonvarfK))  then  begin 

Compile  a  check  that  X  ^nd  y  have  the  same  functor  and  arity  a ; 
for  i  :=  I  to  n  do  begin 

Append  unify(A', .  .  F ,  V,/  )  to  Code 

end; 

return 

end  if  (nonvarfX )  and  varfy ))  then  Swap  X  and  K 
elsi  if  (var(X )  and  nonvar(y ))  then  Do  nothing; 

if  (X  «  Uf/ )  then  return  unify_wriic(X .  X,  f .  V'x/ )'. 
else  begin  /*  At  this  point  X  €  V',/  */ 

if  (F  implies  nonvarfX ))  then  return  unify _rcad(X ,y,F,V,^) 
else  if  (F  implies  var(X  ))  then  return  unify.writcfX ,  y ,  f ) 
else  begin 

Compile  a  three-way  conditional  branch  comparing  the  tags  ofX  and  Y: 

Call  unify_read  and  unify_writc  to  compile  the  read  and  write  mode  branches 
end 
end 

end; 

Figure  5.4  -  The  unification  compiler;  the  main  routine 


This  technique  was  proposed  as  an  optimization  ova  the  WAM  by  Andrd  Marien  (44J.  The  examples  of 
unification  given  later  use  this  technique.  The  justification  of  the  BAM  insmKiions  needed  for  unification 
was  done  with  this  technique  (Chapter  3). 

3  J.2.2.  Last  argument  optimization 

This  is  an  important  optimization  that  significantly  reduces  the  code  size.  It  can  be  performed  when¬ 
ever  a  compound  tarn  has  a  compound  term  in  its  last  argument.  Without  this  optimization,  the  tree  gen¬ 
erated  by  the  algorithm  has  the  same  depth  as  the  term  that  is  compiled.  For  each  level  in  the  tree  a  new 
block  of  write  mode  code  is  generated.  For  lists  of  n  elements  this  results  in  O(n’)  move  instructions. 
The  optimization  reduces  the  code  size  to  0(n)  by  creating  only  a  single  write  mode  block,  and  letting  all 
depths  of  the  uec  jump  into  it  This  optimization  was  proposed  by  Mats  Carlsson  1 14],  The  code  for  write 


function  unify_wriic(X ,  Y :  term;  F  :  formula;  V,f  :  set)  return  {Code  :  list;  Fgyi  :  formula;  VV/.««  :  set); 
begin 

/*  At  this  point  X  is  an  unbound  variable  •/ 

Generate  a  block  of  instructions  to  create  the  term  Y  on  the  heap; 

Bind  X  to  this  block  (i.c.  generate  code  to  dereference  X  if  necessary, 
store  a  pointer  to  this  block  in  X ,  and  trail  X  if  necessary) 

end; 

function  unify _read(X  ,Y:  term;  F  ;  formula;  V'*/  :  set)  return  {Code  :  list;  F„^  ;  formula;  V,/  .,*,  ;  set); 
begin 

/•  At  this  point  Visa  nonvariable  and  F  ipiplies  nonvarfX )  */ 

Code  ;=  ( 1; 

Compile  a  check  that  X  contains  a  structure  of  same  functor  and  arity  as  Y ; 
for,/  ;=  1  to  arityfy )  do  begin 

Append  unifyfX,  ,Y,,F  Code 

end 

end; 

Figure  S.S  -  The  unification  compiler,  read  and  write  mode  unification 


mode  unification  of  a  nested  term  is  replaced  by  a  single  jump  instruction  to  the  write  mode  code  block  of 
the  outermost  term.  An  example  of  unification  given  below  uses  this  optimization. 

3J.2J.  Type  propagation 

There  arc  two  ways  in  which  propagating  type  information  during  the  compilation  of  unification 
improves  the  code.  First,  during  the  unification,  the  algorithm  keeps  back  of  the  variables  that  arc  ground, 
uninitialized,  and  recursively  dereferenced.  This  infomation  is  propagated  into  the  arguments  of  com¬ 
pound  terms.  The  propagation  of  ground  and  recursively  dereferenced  types  was  added  after  measure¬ 
ments  of  the  dauflow  analyzer  showed  that  these  types  are  numerous. 

Second,  when  a  new  variable  is  encountered  in  a  term,  then  the  unification  compiler  has  the  choice 
whether  to  create  it  as  an  initialized  variable  or  as  an  uninitialized  variable.  It  is  not  always  best  to  create 
new  variables  as  uninitialized,  since  this  often  makes  it  impossible  to  apply  last  call  t^timization.  To  solve 
this  problem  it  is  necessary'  to  locA  ahead  in  the  clause.  The  variable  is  created  as  uninitialized  only  if  there 
is  a  goal  later  in  the  clause  with  this  variable  in  an  argument  position  that  must  be  uninitialized. 
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3J.2.4.  Deplli  limitinj; 

Because  the  unification  compiler  generates  a  separate  read  and  write  mode  branch  for  each  functor  in 
the  term  that  is  unified,  deeply  nested  terms  result  in  a  code  size  explosion.  The  last  argument  optimization 
(sec  above)  reduces  the  code  size  when  the  nesung  occurs  in  the  iasi  argument  For  other  cases,  a  different 
technique  is  necessary.  The  unification  compiler  replaces  a  deeply  nested  subicrm  by  a  variable,  creates 
the  subterm  with  write  mode  unification  and  docs  a  general  unification  with  the  variable.  The  depth  limit  is 
set  by  the  compiler  option  depth_limic(N) .  and  the  default  depth  is  i4-2.  For  example,  consider  the 
following  unification  where  the  complicated  term  z  ( .  . . )  is  nested  deeply: 

X-s (t  <u  (.. .z  (...)...))  ) 

It  is  replaced  by  a  sequence  of  three  unifications: 

x-s (t (u (. . .A. . . ) ) ) ,  B-z(...).  A-B 

The  variable  B  docs  not  yet  have  a  value,  so  the  unification  B«z  (...)  is  executed  in  write  mode.  A  gen¬ 
eral  unification  is  performed  for  A«B.  Since  the  size  of  a  write  mode  unification  is  linear  in  the  size  of  the 
compound  term,  this  considerably  shonens  the  code  for  deeply  nested  terms.  Measurements  were  done  to 
determine  the  effect  of  this  transformation  on  execution  time.  In  most  cases  it  is  insignificant,  e.g.  for  the 
nand  benchmark  (Chapter  7),  a  program  that  contains  deeply  nested  structures,  the  difference  in  execution 
time  between  depth  limits  of  two  and  three  is  insignificant  (i.e.  only  a  few  cycles  out  of  several  hundred 
thousand). 

32323.  Examples  of  unification 

Consider  the  following  sample  clause: 
a(A.  s(A, (XIX)) ) . 

The  WAM  code  for  this  clause  is  (assuming  the  two  arguments  of  the  clause  arc  in  registers  r  (0)  and 


r  (D): 
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procedure  a/2 

get_structure  s/2,r(l) 
unify_value  r(0) 
unify_variable  r(3) 
get_list  r(31 
unify__variable  r(2) 
unify_value  r(2) 
proceed 

Temporary'  values  arc  stored  in  registers  r(2) 


;;  the  clause  has  two  arguments. 

;;  unify  r(l)  with  s<A. (X|X1). 

::  unify  the  first  argxnnent  with  r(0). 

load  the  second  argument  into  r(3) . 
;;  unify  r(3)  with  {XIX). 

load  the  first  argument  into  r(2). 

::  unify  the  second  argument  with  r(2). 
;;  return  to  caller. 

and  r  ( 3 ) .  The  execution  time  of  this  code  averaged 


over  read  and  write  mode  is  63  cycles  on  the  Xenoiogic  X-1  processor  185],  an  implementation  of  the  PLM 
architecture  [28].  The  BAM  code  generated  for'the  same  clause  is  (the  pragmas  have  been  left  out  for  clar¬ 
ity): 


procedure (a/2} . 

deref (r (1) , r(l) ) . 

switch (tstr, r (1) .  1 (a/2. 3) . 1 (a/2,4) . fail) 


label (1 (a/2,  3) ) . 

trail (r (1) )  .  ; 

inove(tstr*h,  (r(l)  ] ) .  ; 

push (tatm* (s/2) , h, 1) .  ; 

push(r (0) ,h, 1) . 
push(tlst* (h+2) ,h, 1) . 
pad(l) . 

label (1 (a/2, 1) ) . 

move <tvar*h, r (2) ) .  : 

push(r (2) ,h,l) . 
push  (r  (2)  ,h,  1-)  . 
return. 

label (1 (a/2. 4)). 

equal  (Ird)  j.tatm*  (s/2),  fail) 
move  ( (r  (1) -^l) ,  r  (3) )  .  ; 

deref (r (3) , r (3) )  . 
deref (r (0) , r (0) ) . 
unify (r (3) ,  r(0) ,?,?, fail) . 
move(lr(l)'*2)  .r(0) )  . 
deref (r (0) , r(0} ) . 
switch (tlst,  r (0) ,1 (a/2, €) , l(a/2 
label (1 (a/2,  6)). 
trail (r (0) ) . 
move (tlst'h, {r(0) ) } . 
jumpd  (a/2, 1) ) .  ; 

label (1 (a/2. 7)). 

move( (r (0) ] . r (2) )  . 

move ( (r (0)^1) . r (0) ) . 

deref (r (0) , r  (0) )  . 

deref (r (2) , r (2) ) . 

unify (r (0) ,  r (2) fail) .  ; 

return. 


dereference  r(l) . 
three-way  branch, 
write  mode  for  s(A, {X|X)) 
conditionally  push  r(l)  on  trail  staclc 
bind  s(A, [X|X))  to  second  argument, 
create  the  term  s(A, (X|X]|. 


common  code  for  last  arg.  opt. 
create  the  two  arguments  of  (XIX). 


;;  read  mode  for  s(A. (X|X]). 
check  functor  4  arity  of  s/2, 
load  first  argument  into  r(3) . 


unify  first  argument  with  r(0). 
load  second  argument  into  r(0). 

7), fail).  ;;  three-way  branch. 

;;  write  mode  for  (XIX). 


jump  to  common  code  (last  arg.  opt.). 
; ;  read  mode  for  {XIX). 


unify  arguments  of  (XIX). 


Again,  the  two  arguments  of  the  clause  arc  in  registers  r  ( 0 )  and  r  ( 1 )  and  temporary  values  are  stored 
in  registers  r(2)  and  c  (3) .  To  reduce  the  code  size,  the  write  mode  code  for  (XIX)  jumpsiniothe 


middle  of  ihc  code  for  s  ( A,  [  X I X  ] ) .  With  this  optimization  the  code  is  29  BAM  instnictjons  long  (after 
translation  and  instruction  reordering,  this  is  264  bytes  on  the  VLSI-BAM).  The  WAM  code  is  only  7 
instructions  long  (17  bytes  on  the  PLM)  because  each  insuuction  encapsulates  a  choice.  WAM  instrucuons 
for  unification  assume  the  existence  of  a  rcad/wntc  mode  bit  in  the  implementation,  which  collapses  the 
execution  tree  onto  itself. 

The  code  size  ratio  VLSl-BAM/PLM  is  large  for  this  example.  It  was  hoped  during  development 
that  (I)  code  expansion  would  be  less  for  other  kinds  of  Prolog  code  (e.g.  calls,  parameter  passing,  back¬ 
tracking),  and  (2)  dataflow  analysis  would  reduce  the  complexity  of  unifications.  These  intuitions  have 
been  borne  out  (Chapter  7):  the  static  code  size  in  VLSI-BAM  bytes  messural  for  large  programs  is  only 
three  times  that  of  the  PLM,  a  microcoded  WAM  with  a  byte-coded  instruction  set 

The  execution  time  of  the  above  code  on  the  VLSI-BAM  is  25  cycles  (measured  with  a  simulator 
taking  pipeline  delays  into  account  and  averaged  over  read  and  write  mode).  This  is  about  40%  of  the 
cycles  needed  for  the  X-I.  This  time  can  be  estimated  by  taking  the  average  execution  times  of  BAM 
instructions  when  translated  to  the  VLSI-BAM  architecture:  unify  takes  S  cycles,  equal  takes  3 
cycles,  switch,  deref,  trail,  and  move  from  memory  take  2  cycles  each,  push,  adda,  andall 
other  move  instructions  take  1  cycle  each,  and  pad  insuuciions  take  0  cycles  because  they  are  collapsed 
into  the  pushes.  These  estimates  are  only  approximately  correct  because  of  instruction  reordering  optimi¬ 
zations  performed  on  VLSI-BAM  code. 

Through  programmer  annoution  or  dataflow  analysis  it  is  sometimes  possible  to  know  the  type  of  an 
argument  at  compile-time.  For  example,  sometimes  it  is  known  whether  an  argument  is  unbound  or  bound. 
Consider  the  same  sample  clause  again: 

a(A,  s(A, (X|X1) ) . 

Assume  it  is  known  that  the  second  argument  is  an  uninitialized  memory  variable.  This  is  expressed  with 
the  following  type  declaration. 

mode ( (a (A, B) uninit_mem (E) ) ) . 

With  this  type  the  clause's  code  is  only  9  BAM  instructions  long  (36  bytes  on  the  VLSI-BAM): 


procedure (a/2) 

move(t3tr*h. (r(l) ])  . 
push (tatm* (s/2) , h, 1) . 
push (r (0) . h, 1 )  . 
push (tl St" (h+2) ,h. 1) . 
pad  ( 1 )  . 

move (tvar'h, r  (0) )  . 
push (r (0) , h, 1 )  . 
push (r (0) , h, 1 )  . 
return. 


;;  bind  s(A, (X|X])  to  second  argument. 
;;  create  the  term  s(A, (X|X]). 


;;  create  the  two  arguments  of  (X|X]. 


;;  return  to  caller. 


The  execution  lime  of  (his  example  is  1 1  cycles. 
3.4.  Entry  specialization 


For  each  goal  in  the  clause,  the  clause  compiler  attempts  to  replace  it  with  a  faster  entry  point, 
depending  on  the  types  existing  at  that  point  For  example,  if  it  is  known  that  the  arguments  N  and  A  of  the 
predicate  functor  (X,N,  A)  arc  atomic  then  a  faster  version  can  be  compiled. 

'  Enu7  specialization  is  done  in  both  the  clause  compiler  and  the  dataflow  analysis.  Doing  it  in  both 
places  is  complementary  since  the  analysis  only  keeps  track  of  a  limited  set  of  types:  ground,  nonvariable, 
uninitialized,  and  recursively  dereferenced.  During  clause  compilation  more  infotmation  is  known,  for 
example,  if  the  goal  x<y  occurs  in  a  clause,  then  afterwards  it  is  known  that  X<y  is  true.  Analysis  does 
not  have  a  representation  for  this  information,  but  it  could  be  useful  for  entry  specialization. 


atomic  (A)  ? 


Figure  5.6;  Example  of  a  modal  entry  tree  for  entry  specialization 


•  Entry  specialization  can  be  done  for  any  predicate  whose  definition  is  not  in  the  program.  The  sys¬ 
tem  has  implemented  this  for  the  built-in  predicates,  but  it  can  be  used  by  the  programmer  for  any  library 
predicate.  For  each  predicate  that  has  (aster  entry  points,  a  modal^entry  declaration  is  given,  along 
with  type  declarations  for  the  fast  entry  points.  These  declaration  are  used  in  the  dataflow  analysis  and  the 
clause  compiler  to  replace  any  call  to  the  predicate  with  a  faster  entry  point  For  example,  here  is  the 
modal  entry  declaration  for  the  name  (A,  B)  built-in  predicate; 


nK)dal_entry  (nametA.B) , 
mode (atomic (A)  . 

mode (uninit  (BJ  , 

entry('S  name  >  1  ‘ZMA.B)), 
entry ('$  name  >  l*(A,B)) 

). 

mode (unlnit (A) , 

entry('S  name  < 
mode (uninit (B) , 

entry ('S  name  >  ‘ZMA.B)), 
entry (name (A.B) ) 

) 

)))  - 

This  declaration  defines  a  binary  tree,  depicted  in  Figure  5.6.  The  nodes  of  the  tree  are  decision  points  con- 
uining  a  type.  If  the  type  is  valid  then  the  left  subtree  is  chosen,  otherwise  the  right  subtree  is  chosen.  The 
leaves  of  the  vec  are  the  entry  points.  If  none  of  the  types  arc  valid  then  the  lefunosi  leaf  is  chosen,  which 
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usually  is  (he  same  predicate  as  the  original  one.  Each  of  the  four  fast  entry  points  also  has  a  type  declara¬ 
tion: 


modet'S  name  >  1' (A,B), 

modet'S  name  <  ‘I'tA.B), 

modet'S  name  >  *2'(A,B), 


(derfef (A) , deref tB) ) ,  atomic (A), 

(list (B) , ground (B) ) ,  n) . 

(uninit (A) . deref (B) )  ,  true, 

(atomic (A) . deref (A) , list (B) . ground (B)  )  ,  n)  . 
(deref (A) , uninit (B) ) ,  true, 

(atomic (A) , list (B) .ground (B) , rderef (B) ) ,  n)  . 


modet'S  name  >  1  ‘S'tA.B).  (deref (A) , uninit (B) ) .  atomic(A). 

(list (B) , ground (B) , rderef (B) ) ,  n) . 

These  declarations  arc  written  in  a  five-argument  form  Uiat  is  more  general  than  a  standard  type  declaration 

(Appendix  A):  it  gives  the  entry  types  (both  Require  and  Before)  and  the  exit  (After)  types  for  the  piedi- 
« 

cate. 


3.5.  The  write-once  transformation 

In  (he  BAM  all  unbound  variables  are  kept  on  the  heap.  This  makes  trail  checking  significantly  fas¬ 
ter.  However,  when  combined  with  the  ability  to  destructively  modify  the  value  of  permanent  variables 
(e.g.  to  dereference  them  and  save  (he  dereferenced  value  in  the  permanent)  it  leads  to  several  problems. 
These  problems  arc  all  neatly  resolved  by  the  wiite-once  transformation. 

Putting  all  unbound  variables  on  the  heap  means  (hat  there  are  no  pointers  to  the  environmcnt/choicc 
point  stack;  all  pointers  point  to  the  heap.  This  reduces  trail  checking  to  a  single  comparison  with  the  heap 
backtrack  pointer  r  (hb)  and  a  conditional  push  to  the  trail  stack.  It  is  not  necessary  to  do  another  com¬ 
parison  to  decide  whether  the  variable  is  on  the  heap  or  in  an  environment.  In  addition,  since  all  unbound 
variables  are  created  on  the  heap  (here  are  no  “unsafe  variables'’  as  in  the  WAM.  An  unsafe  variable  is  an 
unbound  variable  that  is  created  on  the  environment  and  that  must  be  moved  to  the  heap  (“globalized’') 
before  last  call  optimization  dcali(x;atcs  its  memory. 

M<xlifying  the  value  of  a  permanent  variable  (c.g.  by  dereferencing  or  binding  it)  cannot  be  done  ' 
without  a  (rail  operation.  Indeed,  consider  the  case  where  a  permanent  dereferences  to  a  nonvariable  term. 
If  the  dereferenced  value  overwrites  the  original  value,  then  both  the  original  value  and  its  address  have  to 
be  trailed  since  backtracking  has  to  restore  the  original  value.  This  is  expensive,  since  it  has  to  be  done 


every  lime  a  peimaneni  is  bound  or  dereferenced. 

One  solution  to  this  problem  is  never  to  store  a  dereferenced  permanent  back  in  the  environment 
This  solves  the  problem  but  it  is  inefficient  since  a  permanent  may  have  to  be  dereferenced  several  times  in 
a  clause. 

A  better  solution  is  to  allocate  a  new  permanent  on  the  environment  whenever  the  value  of  an  old 

one  needs  to  be  changed.  The  new  permanent  gets  the  new  value  and  the  old  permanent  is  unchanged.  As 

a  result,  all  permanent  variables  are  only  given  values  once,  so  they  are  called  “wriie-once”  permanents. 

Because  it  is  not  changed,  the  old  permanent  does  not  have  to  be  trailed.  At  the  cost  of  a  slightly  bigger 

environment,  this  completely  eliminates  the  need  to  trail  permanent  variables.  This  allocation  scheme  is 
< 

implemented  in  the  clause  compiler. 

To  summarize; 

(1)  '  All  unbound  variables  are  created  on  the  heap,  and  unbound  permanent  variables  in  an  environment 

always  point  to  the  heap. 

(2)  The  trail  check  is  a  single  comparison  with  r  (hb)  and  a  conditional  push  to  the  trail  stack  (2  cycles 
on  the  VLSl-BAM). 

(3)  Permanent  variables  arc  only  given  a  single  value  in  a  clause.  Whenever  a  permanent  would  be 
changed,  a  new  one  is  allocated  and  given  the  modified  value. 

(4)  Register  allocation  must  allocate  a  different  permanent  register  for  each  permanent  variable  in  the 
clause.  It  is  not  allowed  to  use  the  same  register  for  two  variables  whose  lifetimes  do  not  overlap. 

This  soluu'on  is  implemented  in  (he  clause  compiler  by  mapping  a  permanent  variable  onto  a  new  variable 
whenever  its  value  would  change.  The  register  allocator  treats  the  new  variables  just  like  any  other,  and 
allocates  them  to  temporary  or  permanent  registers. 

The  main  disadvantage  of  (his  technique  is  (hat  environments  arc  larger.  For  example,  consider  a 
clause  of  the  form; 

«(A.E)  a(A.B).  b(B,C).  etC.D),  d(D.E). 

where  variables  arc  chained  from  one  predicate  to  (he  next  In  the  WAM.  it  is  allowed  to  allocate 
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permanent  vanables  such  that  variables  whose  lifetimes  do  not  overlap  are  allocated  to  the  same  pennanent 
register.  For  the  above  example,  this  requires  just  two  permanent  registers,  so  the  total  environment  size  is 
four  words  (a  also  includes  registers  r  (e)  and  r(cp)).  Only  two  permanents  are  needed  no  matter 
how  long  the  chain  of  body  goals  is.  This  method  requires  trailing  of  the  permanent's  values,  because 
backpacking  must  sec  the  original  values.  This  scheme  is  consistent  with  the  original  implementation  of 
the  WAM,  i.c.  binding  permanent  variables  on  the  environment  and  globalizing  unsafe  variables  to  ensure 
correctness. 

In  conpast,  the  number  of  permanent  variables  needed  by  the  wriie-once  technique  increases  linearly 
with  the  length  of  the  chain.  For  the  above  example,  this  requires  four  pennanent  variables,  so  the  total 
environment  size  is  six  words.  The  total  memory  usage  is  increased  by  less  than  this  amount  because  no 
pairing  of  permanents  is  needed. 

This  is  an  example  of  a  trade-off  between  memory  space  and  execution  time.  The  extra  memory 
space  needed  is  comparable  to  the  increased  size  of  the  trail  stack  if  there  is  no  trail  check  for  permanent 
variables.  Since  this  is  small,  I  have  opted  to  decrease  execution  time  at  the  expense  of  larger  environ¬ 
ments.  By  keeping  all  unbound  variables  on  the  heap  and  by  implementing  permanent  variables  as  write- 
once  variables,  permanent  variables  can  be  dereferenced  and  bound  without  trailing,  and  the  cost  of  trailing 
heap  variables  is  reduced  to  a  single  comparison  and  conditional  push. 

3.6.  The  dereference  chain  transformation 

This  Pansformation  is  needed  to  mainuin  consistency  between  the  dataflow  analysis  and  the  clause 
compiler.  A  new  unbound  variable  (of  either  initialized  Qrpe  or  uninitialized  memory  type)  is  created  as  a 
pointer  to  a  memory  location.  Binding  the  variable  stores  the  new  value  in  the  location.  However,  the 
registcr(s)  that  originally  contained  the  unbound  variable  still  have  pointers  to  the  location.  One  level  of 
indirection  is  needed  to  access  the  value. 


Just  before  the  call  to  a(A) 


argument  A; 


Just  after  the  call  to  a(A) 


Figure  S.7  -  The  need  for  (he  dereference  chain  uansformation 


To  see  why  this  is  necessary  and  what  it  implies,  consider  the  execution  of  the  clause  main  (Figure  3.7): 

main  (i)  a  (A) ,  (it)  writ«  (A)  . 
a(A)  A-s(t<a),u(b).v(c))  . 

The  relevant  situation  can  be  seen  in  the  transition  from  (i)  (just  before  the  call  to  a  ( A) )  to  (ii)  (just  after 
the  call  to  a  (A) ).  At  (i)  a  new  unbound  variable  A  is  created  on  the  heap.  At  fit)  the  variable  A  has  been 
bound  to  a  value.  The  imponani  point  is  that  A  still  has  a  t  va  r  tag,  and  that  one  mdirection  is  needed  to 
access  the  tstr  pointer.  The  extra  link  exists  because  the  aeation  of  A  and  its  binding  arc  done  in 
separate  steps.  This  is  true  for  both  initialized  unbound  variables  and  uninitialized  memory  sariablcs. 


This  situation  is  not  a  problem  unless  dataflow  analysis  dctcimines  that  A  is  leiumed  as  a  derefer¬ 
enced  value.  In  that  case  there  is  a  conflict  between  what  the  analysis  deduces  and  what  the  clause  com¬ 
piler  thinks  is  true.  There  arc  two  ways  to  solve  this  problem;  either  weaken  the  analysis  so  that  it  will  rtot 
deduce  a  dereference  type  in  this  case,  or  modify  the  clause  compiler  to  ensure  that  the  variable  is  derefer¬ 
enced  by  doing  an  extra  indirection  whenever  the  variable  is  accessed  after  it  is  bound.  The  compiler 
implements  the  second  solution  since  dereferencing  is  a  time-consuming  operation  and  it  is  important  to 
derive  as  many  dereference  types  as  possible.  The  trade-off  between  doing  an  extra  indirection  for  a  value 
that  may  not  be  accessed  later  and  doing  an  extra  dereference  loop  seemed  to  be  a  fair  one. 

The' compiler  inserts  code  to  do  this  indirection  whenever  the  variable  is  accessed  after  it  is  bound. 
In  addition  to  m'aintaining  consistency  with  the  analysis,  this  speeds  up  later  dereferencing.  There  is  a 
minor  interaction  with  the  register  allocator — for  correctness,  variables  that  get  an  extra  indirection  are  not 


allowed  to  be  pref  pairs. 


Chapter  6 

BAM  Transformations 


1.  Introduction 

After  compiling  the  program  from  kernel  Prolog  into  BAM  code,  a  scries  of  optimizing  transforma¬ 
tions  is  performed.  The  transformations  performed  are;  (1)  duplicate  code  elimination.  (2)  dead  code  elim¬ 
ination,  (3)  jump  elimination,  (4)  label  elimination.  (S)  synonym  optimization,  (6)  peephole  optimization, 
and  (7)  determinism  optimization.  This  chapter  first  gives  two  definitions  and  then  presents  the  transforma¬ 
tions. 

2.  Definitions- 

The  following  two  definitions  are  useful: 

Definition  DB:  A  distant  branch  is  a  branch  that  always  transfers  control  to  an  instruction 
other  than  the  next  in  the  instruction  stream. 

According  to  this  definition,  there  are  exactly  four  distant  branches  in  the  BAM;  fail,  return,  jump,  and 
switch.  All  other  branches  do  not  satisfy  the  definition  since  they  can  fall  through  to  the  next  instruction. 

Definition  BB;  A  contiguous  block  is  any  sequence  of  instructions  that  terminates  with  a  dis¬ 
tant  branch. 

According  to  this  definition,  a  contiguous  block  can  start  with  any  instruction  and  can  conuin  conditional 
branches  with  a  fall  through  case.  Therefore  the  code  contains  a  large  number  of  overlapping  contiguous 
blocks.  This  is  useful  to  get  maximum  optimization  when  kioking  for  contiguous  blocks  that  satisfy  some 
property.  The  individual  transformations  mentioned  in  this  chapter  will  usually  only  look  at  contiguous 
blocks  satisfying  certain  constraints,  for  example,  the  contiguous  blocks  that  begin  with  a  label. 

3.  The  transformations 

Seven  transformations  (Figure  6.1)  are  done  on  the  BAM  code  generated  for  each  predicate  by  the 
kernel  to  BAM  compilation  stage.  A  transitive  closure  is  performed  on  the  sequence  of  seven  transforma¬ 
tions,  i.e.  they  arc  applied  repeatedly  until  there  are  no  more  changes.  Each  transformation  is  carefully 
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coded  to  result  in  code  that  is  better  (i^.  faster  or  shorter)  than  its  iI^)ut,  so  the  closure  operation  ter¬ 
minates. 


Figure  6.1  -  BAM  Transformations 


3.1.  Duplicate  code  elimination 

All  duplicate  contiguous  blocks  except  the  last  occurrence  are  replaced  by  a  jump  to  the  last  one. 
This  optimization  is  also  known  as  cross-jumping.  It  tightens  up  loose  code  generated  by  the  type  enrich¬ 
ment  transformation  (Chapter  4).  It  is  implemented  by  first  creating  an  table  indexed  by  all  contiguous 
blocks  that  (1)  begin  with  a  label.  (2)  do  not  contain  any  other  labels  (but  they  arc  allowed  to  contain 
branches),  and  (3)  are  not  degenerate  blocks  that  consist  of  only  a  single  jump,  return,  or  fail  insmiaion 
(but  a  single  switch  is  allowed).  The  table  contains  the  label  of  the  last  occurrence  of  the  Mock.  All  con- 


(iguous  blocks  in  the  code,  including  those  that  do  not  begin  with  labels,  are  looked  up  in  the  ubie  and 
replaced  by  jumps  if  they  are  not  the  last  occurrence.  The  result  of  this  optimization  is  to  reduce  code  size 
at  the  price  of  slightly  slowing  down  execution. 

3.2.  Dead  code  elimination 

All  code  that  is  not  reachable  from  the  entry  point  of  a  predicate  is  removed.  This  is  done  in  two 
steps:  First,  all  the  labels  that  arc  reachable  through  any  number  of  branches  are  calculated  by  doing  a  tran¬ 
sitive  closure.  Second,  a  linear  traversal  of  the  code  is  done  and  the  instructions  following  a  distant  branch 
up  to  the  n.cxt  reachable  label  are  eliminated. 

< 

3J.  Jump  elimination 

Rearrange  contiguous  blocks  to  minimize  the  number  of  jump,  call,  and  return  instructions.  This 
optimization  is  a  variant  of  the  jump  chaining  optimization.  A  transitive  closure  is  done  on  the  following 
replacements; 

(1)  Replace  a  jump  by  the  contiguous  block  it  points  to  if  the  block  is  only  pointed  to  by  one  branch  or  if 
the  block  is  shorter  than  a  preset  threshold.  The  threshold  can  be  changed  by  a  compiler  directive. 
The  replacement  is  not  done  if  the  block  is  part  of  write  mode  unification  or  unification  with  an  atom, 
since  these  two  cases  are  hurt  by  the  transformation. 

(2)  Replace  a  call  to  a  dummy  predicate  by  the  code  for  the  predicate  if  it  is  straightlinc  code,  i.e.  its 
code  consists  only  of  non-branches,  call  instructions,  and  branches  all  of  whose  destinations  are 
fail.  The  predicate's  code  must  be  terminated  by  a  return  or  fail  instruction. 

(3)  Replace  a  conditional  branch  to  a  conditional  branch  by  a  new  conditional  branch  if  possible.  The 
only  case  currently  recognized  is: 

t«st(ne,tvar.V.L) . 
label (L) . 

switch (Taq.V,  fail, L2,L3) . 


which  causes  the  test  instruction  to  be  replaced  by: 


switch (Tag, V.  LI,  L2.L3)  . 
lab«l (LI) . 


(4)  Replace  a  branch  one  of  whose  destinations  is  a  jump  or  fail  instruction  by  a  new  branch  identical  to 
the  original  one  except  that  the  destination  label  is  replaced  by  the  destination  label  of  the  jump  or  by 
fail. 


3.4.  Label  elimination 


Remove  all  labels  that  are  not  jumped  to  by  any  branch  in  the  code.  This  is  done  in  two  steps:  First, 
the  set  of  all  destinations  of  all  branch  instructions  is  collected.  Second,  the  labels  not  in  this  set  are 
removed  from  the  code. 


3.5.  Synonym  optimization 


This  transformation  is  similar  to  strength  leduction.  It  docs  a  linear  traversal  of  the  code  and 
replaces  every  addressing  mode  by  the  cheapest  addressing  mode  that  contains  the  same  value.  For  exam¬ 
ple,  if  p  ( 1 )  and  r  ( 0 )  conuin  the  same  value,  then  an  occurrence  of  p  ( 1 )  can  be  replaced  by 
r  ( 0 1 .  The  following  cost  order  (from  cheapest  to  most  expensive)  is  used  by  default  and  is  based  on  the 
cost  in  the  VLSI-BAM  architecture: 


Addressing  mode 

Reason  for  cost 

Overhead 

(cycles) 

r  (b) 

Promotes  creation  of  cut  ( r  (b) )  which  is  a  no-op 

0 

r  (I) 

Usable  without  overhead 

0 

Atom 

Requires  Idi  (load  immediate)  instruction 

1 

Tag*X 

Tagged  pointer  creation  needs  lea  (load  effective  address)  instruction 

1 

p(I) 

Permanent  variable  needs  Id  (load)  instruction 

1 

tr(I) J 

Indirection  needs  Id  (load)  instruction 

1 

Ir |I)+N) 

Offset  indirect  needs  Id  (load)  instruction 

1 

[P(I)} 

Indirect  permanent  needs  2  Id  Goad)  instructions 

2 

lp(l)+N] 

Offset  indirect  permanent  needs  2  Id  (load)  instructions 

2 

r (void) 

Most  expensive  because  it  must  not  be  changed 

- 

The  reason  given  for  the  cost  describes  the  instructions  ncccssarj'  to  implement  the  addressing  mode 
for  the  VLSI-BAM.  More  information  on  the  instruction  set  of  the  VLSI-BAM  is  given  in  (34J.  The 
addressing  mode  r  (void)  is  created  by  the  register  allocator.  It  corresponds  lo  a  void  variable,  i.e.  a 
variable  that  occurs  only  once  in  a  clause  and  whose  value  (hay  therefore  be  ignored.  It  is  made  the  most 
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expensive  because  ii  must  remain  unchanged  so  that  peephole  optimization  can  remove  the  instruction  con¬ 
taining  it. 

The  synonym  optimization  is  implemented  by  maintaining  a  set  of  equivalence  classes  at  all  points  of 
the  program,  where  each  equivalence  class  is  a  set  of  addressing  modes  whose  values  are  identical.  Labels 
in  the  code  cause  the  set  of  equivalence  classes  to  be  reset  to  empty.  A  future  extension  of  this  module 
could  eliminate  this  restriction  by  following  the  labels  and  performing  a  transitive  closure,  resulting  in  a 
slight  performance  gain. 

3.6.  Peephole  optimization 

A  transitive  closure  is  performed  on  a  peephole  transformation  with  a  window  of  three  tnsirtictions. 
The  set  of  patterns  was  determined  empirically  by  looking  at  the  compiler's  output  and  adding  patterns  to 
fix  obvious  inefficiencies.  Each  pattern  is  implemented  as  a  single  clause  in  the  optimizer.  The  patterns 
are  one,  two,  and  three  instructions  long.  However,  the  window  is  extended  to  arbitrary  size  for  one  pat¬ 
tern,  a  generalized  last  call  optimization: 

call (N/A) . 

deallocated).  %. Arbitrary  number  of  deallocate  instructions. 

deallocate ( J) . 
return . 

which  is  transformed  to: 

deallocated).  %  Same  sequence  as  above. 

deallocate (J) . 
jump (N/A) . 

3.7.  Determinism  optimization 

A  choice  instruction  is  removed  if  it  is  followed  by  a  sequence  of  instructions  that  cannot  fail  and  a 
cut  instruction.  This  simple-looking  optimization  significantly  increases  determinism — many  predicates 
(e.g.  Warren’s  quicksort  benchmark)  conuining  a  cut  become  deterministic  that  would  otherwise  be  com¬ 
piled  with  a  choice  point. 


A  similar  optimization  is  performed  by  the  simplification  transformation  of  kernel  Prolog  (Chapta 
4).  For  example,  it  transforms  (!,p  ;  q)  into  (!,p).  The  determinism  optimization  extends 
simplification — if  the  goal  s  compiles  into  instructions  that  cannot  fail  then  it  is  able  to  successfully 
optimize  the  BAM  code  of  (s,  !  ,p  ;  q}  even  when  simplification  cannot  determine  that  s  always 
succeeds. 

Consider  this  predicate,  which  contains  no  cut; 

mode ( (max (A, B, C)  uninit (C) )) .  %  C  is  unbound  and  unaliased. 

max (A,  B,  C)  A<B,  B-C.  %  No  cut  here, 

max  (A,  B,  C)  A-C. 

It  is  compiled  into  the  following  BAM  code  (slightly  simplified  for  readability): 


procedure (max/3)  . 

deref  (r (0) , r (0) ) . 
deref (r(l)  ,r(l) ) . 

jumpdts,  r  (0) ,  r(l) ,  1  (max/3, 1) )  .  %  Conditional  branch  A<B. 

move(r(0) , (r(2) 1) .  %  A<B  is  false, 

return . 

label (1 (max/3, 1 ) ) . 

choice (1/2, (0,2), 1 {max/3, 4) > .  %  A<B  is  true, 

move  (r  (1) ,  (r  (2)  ] )  . 
return . 

label  (1  (max/3,  4)-)  . 

choice(2/2, [0,2], fail)  . 
move  (r  (0)  ,  (r  (2)  ] )  . 
return . 

When  A<B  is  uue,  a  choice  point  is  created  to  try  both  clauses.  If  a  cut  is  inserted  into  the  first  clause: 

mode ( (max (A, B, C)  uninit (C))).  %  C  is  unbound  and  unaliased. 

max(A,  B,  C)  A<B,  !,  B-C.  %  Cut  is  added  here, 

max (A,  B,  C)  A-C. 


then  the  axle  becomes  deterministic: 


procedure  (inax/3)  . 

move  (b,  r  (3) )  . 
deref (r (0) . r (0) ) . 
deref (r (1) . r (1) ) . 

jump  (Its.  r  (0) .  r  (1 )  ,  1  (nvax/3<  4) ) .  %  Conditional  branch  A<B. 

move (r (0) . (r (2) ) ) . 
return . 

label (1 (max/3,  4) ) . 

cut (r (3) )  , 
move(r(l) . {r(2) 1) - 
return. 

Measurements  done  by  Touaii  [70]  justify  this  optimization.  He  6nds  that  it  makes  about  half  of  all  choice 
point  operations  avoidable. 


Chapter  7 

Evaluation  of  the  Aquarius  system 


1.  Introduction 

This  chapter  aiicmpis  to  quantify  some  of  the  ideas  that  were  introduced  in  previous  chapters.  The 
evaluation  process  is  as  important  as  any  other  pan  of  the  implementation  of  a  large  software  system.  Dur¬ 
ing  the  design  phase  it  guides  the  design  decisions.  After  the  design  is  complete,  it  shows  what  features  of 
the  design  contributed  most  to  its  effectiveness  and  it  gives  a  foundation  for  starting  the  next  design.  Quan¬ 
titative  measurements  arc  the  most  reliable  guideposts  one  has  during  the  design.  For  example,  it  is  easy  to 
imagine  many  possible  compiler  optimizations,  but  most  of  these  have  an  insignificant  effect  on  perfor¬ 
mance.  It  is  more  difficult  to  discover  optimizations  that  are  widely  applicable. 

Five  evaluations  are  performed  in  this  chapter 

( 1 )  The  absolute  performance  of  the  system. 

(2)  The  effectiveness  of  the  dataflow  analysis. 

(3)  The  effectiveness  of  the  determinism  transformation. 

(4)  A  brief  comparison  with  a  high  performance  implementation  of  the  C  laitguage. 

(5)  A  bug  analysis,  summarizing  the  number  and  types  of  bugs  encountered  during  development 

Table  7.1  describes  the  benchmarks  used  in  this  chapter  and  their  size  in  lines  of  code  (not  including  com¬ 
ments).  The  benchmarks  were  chosen  as  examples  of  realistic  programs  doing  computations  representative 
of  Prolog.  This  itKludes  benchmarks  that  spend  much  of  their  time  executing  built-in  predicates  because 
this  behavior  is  common  in  real-world  programs.  The  benchmarks  are  divided  into  two  classes,  small  and 
targe,  dqtending  on  whether  the  compiled  code  with  analysis  is  smaller  or  larger  than  1000  words.  The 
benchmarks  loglO,  q)s8,  timcsIO,  and  dividelO  arc  grouped  together  and  referred  to  as  tUiiv  because  they 
are  closely  related.  The  benchmarks  arc  available  by  anonymous  ftp  to  arpa.berkelcy.edu. 

All  VLSI-BAM  numbers  in  this  chapter  were  obtained  from  the  VLSI-BAM  instruaion-lcvcl  simu¬ 
lator  and  include  cache  effects  |I7].  The  simulated  system  has  128  KB  instruction  and  dau  caches.  The 
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Table  7.1  -  The  benchmarks 

Benchmark 

Lines 

Description 

tveverse 

10 

Naive  reverse  of  a  30-element  list 

tak 

15 

Recursive  integer  arithmetic. 

qson 

19 

Quicksort  of  a  50-clemeni  list. 

logic 

27 

Symbolic  differentiation. 

ops8 

27 

Symbolic  differentiation. 

timcslO 

27 

Symbolic  differentiation. 

divide  10 

27 

Symbolic  differentiation. 

serialise 

29 

Calculate  serial  numbers  of  a  list 

quccns_8 

31 

Solve  the  eight  queens  puzzle. 

mu 

33 

Prove  a  theorem  of  Hofstadter's  "mu-math." 

zebra 

36 

A  logical  puzzle  based  on  constraints. 

scndmorc 

43 

The  SEND-i-MORE=MONEY  puzzle. 

fasi_mu 

54 

An  optimized  version  of  the  mu-math  prover. 

query 

68 

Query  a  sutic  daubase  (with  integer  arithmetic). 

poly_I0 

86 

Symbolically  raise  a  polynomial  to  the  tenth  power. 

xrypt 

64 

Solve  a  simple  ciyptarithmetic  puzzle. 

meta_qson 

74 

A  meta-interpreter  running  qsort. 

prover 

81 

A  simple  theorem  prover. 

browse 

92 

Build  and  query  a  database. 

unify 

125 

A  compiler  code  generator  for  unification. 

flatten 

158 

Source  transformation  to  remove  disjunctions. 

sdda 

273 

A  dataflow  analyzer  that  represents  biasing. 

rcduccr_nowritc 

298 

A  graph  reducer  based  on  combinators. 

reducer 

301 

Same  as  above  but  writes  its  answer. 

boycr 

377 

An  extract  from  a  Boyer-Moore  theorem  prover. 

simple.analyzer 

443 

A  dataflow  analyzer  analyzing  qsort. 

nand 

493 

A  logic  synthesis  program  based  on  heuristic  search. 

chat_parscr 

1138 

Parse  a  set  of  English  sentences. 

chat 

4801 

Natural  language  query  of  a  geographical  database. 

caches  arc  direct  mapped  and  use  a  wriie-back  policy.  They  are  nin  in  warm  start*  each  benchmark  is  run 
twice  and  the  results  of  the  first  run  arc  ignored.  The  cache  overhead  is  greatest  for  tak  compiled  without 
analysis,  and  for  poly.lO,  simplc.analyzcr,  chat,  and  boycr.  For  these  programs  it  ranges  from  9%  to  24%. 
For  meta.qsort  reducer,  and  chat_parser  the  overhead  ranges  from  2%  lo  3%.  For  all  other  programs  the 
overhead  is  less  than  O.S%. 


2.  Absolute  performance 

This  section  compares  the  performance  of  Aquarius  Prolog  with  Quintus  Prolog.  Tables  7.2  and  7.3 
compare  the  performance  of  Quintus  Prolog  version  2.5  running  on  a  Sun  4/65  (25  MHz  SPARC)  with  that 
of  Aquarius  Prolog  running  on  the  VLSI-BAM  (30  MHz).  The  "Raw  Speedup"  column  gives  the  ratio  of 
the  speeds.  The  "Normalized  Speedup"  column  divides  this  ratio  by  1.8.  Our  group  is  in  the  process  of 


porting  Ihe  Aquarius  system  to  (he  MIPS,  MC68020,  and  SPARC  processors.  It  was  not  possible  to  get 
numbers  for  these  systems  in  time  for  the  final  version  of  this  dissertation. 


The  normalization  factor  of  1.8  takes  into  account  the  Prolog-specific  extensions  of  the  VLSI-BAM 
(a  factor  of  1 .5)  and  the  clock  ratio  (a  factor  of  30/25  =  1.2).  The  general-purpose  base  architecture  of  the 
VLSI-BAM  is  very  similar  to  the  SPARC.  The  effect  of  the  architectural  extensions  of  the  VLSI-BAM 
(34)  has  been  carefully  measured  to  be  about  U  for  large  programs.  However,  for  the  small  programs  (he 
compiler  is  able  to  remove  many  Prolog-specific  features,  so  that  the  normalized  speedup  numbers  in  Table 
7.2  arc  an  underestimate. 


Table  7.2 

-  Performance  results  for  small  programs  (in  ms) 

Benchmark 

Size 

Quintus  v2.5 

Aquarius 

Normalized 

Raw 

(lines) 

(Sun  4/65) 

(VLSI-BAM) 

Speedup 

Speedup 

dcriv 

1.143 

0.0913 

7.0 

12.5 

log  10 

27 

0.153 

0.0168 

ops8 

27 

0.239 

0.0189 

times  10 

27 

0.345 

0.0257 

divide  lO 

27 

0.406 

0.0299 

nreversc 

10 

1.62 

0.136 

6.6 

11.9 

qson 

19 

4.820 

0.173 

15.5 

27.8 

serialise 

29 

3.10 

0.447 

3.9 

6.9 

query 

68 

23.7 

3.57 

3.7 

6.6 

mu 

33 

7.04 

0.808 

4.8 

8.7 

fast_mu 

54 

9.08 

0.932 

5.4 

9.7 

qucens_8 

31 

21.2 

1.13 

10.4 

18.7 

tak 

15 

1120. 

25.4 

24.5 

44.1 

poly_10 

86 

417. 

35.5 

6.5 

11.7 

sendmorc 

43 

490. 

38.4 

7.1 

12.8 

zebra 

36 

423. 

84.1 

2.8 

5.0 

geometric  mean 

6.7 

12.1 

1  standard  deviation  of  mean 

1.9 

3.3 

For  the  small  benchmarks,  (he  normalized  speedup  is  somewhere  between  6.7  and  12.1  (Table  12). 
The  normalized  speedup  of  (he  large  benchmarks  without  built-in  predicates  is  about  5.2  (Table  73). 
Speedup  is  better  for  the  small  benchmarks  because  dataflow  analysis  is  able  to  derive  better  types  for 
many  of  them.  For  some  of  them  (such  as  lak  and  nreversc)  it  derives  essentially  perfect  types.  The  small 
programs  show  a  large  variation  in  specdups.  The  tak  benchmark  docs  well  because  it  relies  on  integer 
arithmetic,  which  is  compiled  efficiently  using  uninitialized  register  types.  The  zebra  benchmark  docs 
poorly  for  two  reasons.  First,  it  does  a  large  amount  of  backtracking,  which  is  inherently  limited  by 
'memory  bandwidth.  Second,  it  works  by  successively  instantiating  arguments  of  a  compound  data 
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Tabic  7.3  -  Performance  results  for  large  programs  (in  ms) 

Benchmark 

Size 

Quintus  vis 

Aquarius 

Normalized 

Raw 

(lines) 

(Sun  4/65) 

;  (VLSI-BAM) 

Speedup 

Speedup 

No  built-ins 

prover 

81 

8.67 

0.921 

5.2 

9.4 

meta_qsort 

74 

49.6 

4.71 

5.8 

10.5 

nand 

493 

173.3 

13.7 

7.0 

12.7 

reducer_nowritc 

298 

312. 

37.2 

4.6 

8.4 

chat_parscr 

1138 

1157. 

129.5 

5.0 

8.9 

browse 

92 

5450. 

741. 

4.1 

7.4 

geomeuic  mean 

5.2 

9.4 

standard  deviation  of  mean 

• 

0.5 

0.8 

Including  built-ins 

• 

unify 

125 

18.3 

1.40 

12 

13.0 

ilattcn 

158 

13.6 

1.42 

5.3 

9.6 

sdda 

273 

29.5 

2.94 

5.6 

10.0 

crypt  . 

64 

21.7 

4.00 

3.0 

5.4 

simple_analyzcr 

443 

180. 

33.4 

3.0 

5.4 

reducer 

301 

405. 

44.9 

5.0 

9.0 

chat 

4801 

3100. 

699. 

2.5 

4.4 

boycr 

377 

4870. 

1360. 

2.0 

3.6 

geometric  mean 

3.8 

6.9 

standard  deviation  of  mean 

0.7 

1.3 

geometric  mean  (all  large  programs) 

4.4 

7.9 

Table  7.4. 

-  Time  spent  in  built-in  predicates 

Benchmark 

Time  (%) 

Most  used  buih-ins 

prover 

0 

- 

meia_qsort 

0 

- 

chatjiarsa 

0 

- 

nand 

<1 

- 

browse 

1 

length/2 

reducer 

40 

write/1.  C(Mnpare/3.  arg/3 

unify 

40 

arg/3.  functor/3,  compare/3 

crypt 

50 

div/2,  mod/2,  •/2 

boyer 

60 

arg/3.  funaor/3 

simple.analyzer 

70 

compate/3.  sorT/2,  arg/3 

sdda 

70 

wriWl.  e-/2.  compare/3 

flatten 

80 

write/1,  sort/2,  oompare/3,  name/2,  functor/3,  arg/3 

stnicture.  The  analysis  algorithm  does  not  have  a  representation  for  this  opention,  so  it  cannot  be  optim¬ 
ized. 


The  built-in  predicates  in  Aquarius  Prolog  are  not  greatly  faster  than  those  in  Quintus  Prolog,  since 
many  of  the  Quintus  buih-ins  are  not  written  in  Prolog,  but  in  hand-crafted  assembly.  The  Aquarius  system 
shows  better  speedup  over  Quintus  built-ins  written  in  Prolog  (such  as  read/1  and  write/ 1)  and  the 
entry  spodalization  transformation  also  speeds  up  dtc  built-ins.  Table  7.4  gives  the  percentage  of  time  that 


the  benchmarks  spend  executing  inside  built*in  predicates.  This  number  does  not  take  into  account  buiii- 
tns  that  arc  implemented  as  in-line  code  (arithmetic  test,  addition  and  subtraction,  and  type  checking).  The 

j 

tabic  also  gives  the  most  often  used  built-in  predicates  for  each  benchmark  in  decreasing  order  of  usage. 

Several  benchmarks  use  built-in  predicates  significantly.  The  normalized  speedup  for  these  pro¬ 
grams  is  3.8,  somewhat  less  than  programs  without  built-ins  (Table  7.3).  The  normalized  speedup  for  all 
large  programs  is  4.4  (the  reducer  benchmark  is  counted  only  once  in  this  average).  The  boyer  benchmark 
docs  poorly  because  it  relics  heavily  on  the  arg/3  and  functor/3  built-in  predicates.  The  chat 
benchmark  uses  these  built-ins  as  well  as  others  including  setof /3,  but  it  was  not  possible  to  measure 
the  fraction  of  execution  ume  spent  in  them.  The  sdda  and  fiatien  benchmarks  do  well  partly  because  the 
write/ 1  built-in  is  much  faster  in  Aquarius  than  in  Quintus. 

3.  The  effectiveness  of  the  dataflow  analysis 

This  scaion  evaluates  the  effectiveness  of  the  dataflow  analysis  with  three  kinds  of  measurements. 
Tables  7.5,  7.6,  and  7.7  give  the  effect  of  the  dataflow  analyzer  on  performance  and  code  size,  and  the 
efficiency  of  the  analyzer  both  in  terms  of  its  execution  time  and  the  fraction  of  arguments  for  which  types 
can  be  deduced. 

For  a  representative  set  of  realistic  Prolog  programs  of  various  sizes  up  to  I,1(X)  lines,  the  analyzer  is 
able  to  derive  type  information  for  S6%  of  all  predicate  arguments.  It  finds  that  on  average  23%  of  all 
predicate  arguments  arc  uninitialized.  21%  of  arguments  arc  ground,  10%  of  arguments  are  nonvariables, 
and  17%  of  arguments  arc  recursively  dereferenced.  The  sum  of  these  three  numbers  is  greater  than  S6% 
since  it  is  possible  for  an  argument  to  have  multiple  types,  e.g.  it  can  be  ground  and  recursively  derefer¬ 
enced  at  the  same  time.  Doing  analysis  reduces  execution  time  on  the  VLSI-BAM  by  18%  for  programs 
without  built-ins  and  suiic  code  size  by  43%  for  all  programs. 

Table  7.5  gives  the  execution  lime  in  micro^onds  of  the  benchmarks  for  the  VLSI-BAM  compiled 
without  analysis  (No  Modes)  and  with  analysis  (Auto  Modes).  The  last  three  columns  give  the  ratios  of  the 
auto  modes  to  the  no  modes  times.  To  give  an  idea  how  built-ins  affect  the  results  of  analysis.  Tabic  7.5 
gives  (wo  performance  ratios  for  the  large  benchmarks:  the  finst  for  all  programs,  and  the  second  for 
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Table  7.5  - 

The  effect  of  dataflow  analysis  on  performance 

Benchmark 

No  Modes  (ps) 

Auto  Modes  (ps) 

Auto/No  Modes 

Time 

Deref 

Trail 

;  Time 

Deref 

Trail 

Time 

Deref  Trail 

deriv 

146 

18.2 

5.5 

91.3 

0.3 

0.1 

0.63 

0.02 

0.02 

log  10 

25.9 

2.3 

0.7 

16.8 

0 

0 

ops8 

28.5 

3.3 

1.0 

18.9 

0.3 

0.1 

times  10 

39.7 

5.1 

1.3 

25.7 

0 

0 

divide  10 

51.7 

7.5 

2.5 

29.9 

0 

0 

nreversc 

308 

79.7 

31.1 

136 

0 

0 

0.44 

0.00 

0.00 

qsort 

378 

109 

25.1 

173 

0 

0 

0.46 

0.00 

0.00 

serialise 

512 

75.8 

12.3 

447 

44.9 

0.7 

0.87 

0.59 

0.05 

mu 

992 

154  . 

48.0 

783 

139 

34.7 

0.79 

0.90 

0.72 

fast_mu 

1120 

148 

38.0 

932 

64.4 

7.9 

0.83 

0.44 

0.21 

quecns_8 

1700 

271 

67.9 

1090 

33.4 

0 

0.64 

0.12 

0.00 

query 

5180 

560 

174 

3570 

0 

0 

0.69 

0.00 

0.00 

tak 

71700 

13800 

3180 

25400 

0 

0 

0.35 

0.00 

0.00 

poly_10 

60400 

6280 

1740 

35600 

1080 

209 

0.59 

0.17 

0.12 

zebra  , 

84600 

11400 

8.6 

84100 

11400 

8.4 

0.99 

1.00 

0.98 

average 

0.66 

0.29 

0.19 

prover 

1070 

110 

29.4 

820 

51.2 

5.9 

0.76 

0.47 

0.20 

unify 

1600 

198 

33.9 

1400 

138 

19.3 

0.88 

0.69 

0.57 

flatten 

1460 

149 

9.9 

1420 

133 

6.5 

0.97 

0.90 

0.66 

sdda 

3180 

368 

36.9 

2940 

2% 

21.3 

0.92 

0.81 

0.58 

crypt 

4090 

319 

104 

4000 

262 

104 

0.98 

0.82 

1.00 

meta_qsort 

5330 

674 

182 

4450 

417 

63.0 

0.83 

0.62 

0.35 

nand 

18700 

2290 

542 

13400 

902 

22.9 

0.72 

0.39 

0.04 

simple_analyzer 

35400 

3880 

316 

31900 

3080 

76.2 

0.90 

0.79 

0.24 

reducer 

48800 

6680 

1210 

44900 

5580 

731 

0.92 

0.84 

0.61 

chat_parscr 

151000 

19400 

6990 

131000 

11200 

4360 

0.87 

038 

0.62 

browse 

820000 

117000  28600 

741000 

96700  20400 

0.90 

0.82 

0.71 

boycr 

1410000 

73900 

6340 

1360000 

75000 

6270 

0.97 

1.02 

0.99 

average 

0.89 

0.73 

0.55 

average  (no  built-ins) 

0.82 

0.58 

0.39 

programs  that  do  noi  use  buili-ins  significantly  (the  first  five  of  Table  7.4).  Data  initialization  times  arc 


subtraaed  from  deriv,  nreversc,  qsort,  serialise,  and  prover.  The  table  abo  gives  the  time  each  benchmark 


spends  performing  dereferencing  (OercO  and  trailing  (Trail). 


The  time  spent  in  dereferencing  and  trailing,  two  of  the  most  common  Prolog-specific  operations,  is 
significantly  reduced  by  analysis.  For  the  small  benchmarks  analysis  reduces  dereferencing  from  17%  to 
5%  of  execution  time,  and  trailing  from  4%  to  0.6%  of  execution  time.  This  is  because  they  arc  simple 
enough  that  analysis  is  able  to  deduce  most  relevant  modes.  For  the  large  benchmarks  dereferencing  is 
reduced  from  11%  to  9%  and  trailing  is  reduced  from  2.3%  to  1 .3%.  These  results  are  less  extreme  for  two 
reasons;  the  large  benchmarks  use  built-ins,  which  arc  unaffected  by  analysis,  and  the  analyzer  loses  infor- 
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Table  7.6  -  The  effect  of  dataflow  analysis  on  static  code  size 

Benchmark 

No  Modes 
(instructions) 

Auto  Modes 
Onstructions) 

Auto/No  Modes 

tak 

80 

34 

0.42 

nreverse 

287 

139 

0.48 

quccns_8 

472 

146 

0.31 

qsort 

485 

215 

0.44 

deriv 

5891 

1123 

0.19 

loglO 

1464 

272 

ops8 

1469 

277 

times  10 

1479 

287 

divide  10 

147? 

287 

query 

1425 

403 

0.28 

serialise 

860 

520 

0.60 

mu 

1169 

731 

0.63 

fast_mu 

1165 

718 

0.62 

zebra 

1271 

814 

0.64 

poly_10 

3023 

893 

0.30 

average 

0.45 

crypt 

1239 

1027 

0.83 

browse 

1863 

1150 

0.62 

prover 

4395 

1318 

030 

meta.qsort 

2484 

1424 

037 

flatten 

4267 

2335 

035 

unify 

6326 

4210 

0.67 

sdda 

6526 

5031 

0.77 

simple_analyzer 

9057 

5836 

0.64 

nand 

23406 

6654 

038 

reducer 

11726 

7682 

0.66 

boyer 

24862 

9136 

0.37 

chat_parscr 

33557 

20516 

0.61 

average 

0.57 

(nation  due  to  its  inability  to  handle  aliasing  and  its  limited  type  domain. 


Table  7.6  gives  the  static  code  size  (in  VLSUBAM  instructions)  for  the  benchmarks  compiled 
without  analysis  (No  Modes)  and  with  analysis  (Auto  Modes).  The  effect  of  analysis  on  code  size  is 
greater  than  the  effect  on  performance.  This  follows  from  the  compiler's  implementation  of  argument 
selection;  when  no  modes  arc  given,  the  compiler  generates  more  code  to  handle  arguments  of  dilTerent 
types.  If  analysis  derives  the  type  then  the  code  becomes  much  smaller.  The  code  size  compares  favorably 
with  other  symbolic  processors,  and  is  low  enough  that  there  is  no  disadvanugc  to  having  a  simple  instruc¬ 
tion  set  With  the  analyzer,  code  size  on  the  VLSI-BAM  is  similar  to  the  KCM  (6],  about  three  times  the 
PLM,  a  micro-coded  WAM  (281,  and  about  one  fourdt  the  SPUR  using  macro-expanded  WAM  |8]. 
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Tabic  7.7  -  The  efficiency  of  dataflow  analysis 


Benchmark 

Args  Freds 

Time 
(sec)  ' 

Modes  (fraction  of  arguments) 
uninit  ground  nonvar  rderef  any 

dcriv 

12 

8 

11.9. 

0.33 

0.67 

0.00 

0.67 

1.00 

logic 

3 

2 

2.9 

opsS 

3 

2 

3.0 

times  10 

3 

2 

3.0 

divide  10 

3 

2 

2.9 

tak 

4 

2 

2.3 

0.25 

0.75 

0.00 

0.75 

1.00 

nreverse 

5 

3 

2.2 

0.40 

0.60 

0.00 

0.60 

1.00 

qsori 

7 

3 

3.4 

0.43 

0.57 

0.00 

0.57 

1.00 

query 

7 

•5 

.  4.2 

0.86 

0.14 

0.00 

0.14 

1.00 

zebra 

10 

6 

3.5 

0.10 

0.00 

0.50 

0.00 

0.60 

serialise 

16 

7 

4.2 

0.38 

0.19 

0.06 

0.19 

0.63 

quccns_8 

16 

7 

6.0 

0.31 

0.69 

0.00 

0.69 

1.00 

mu 

17 

8 

9.6 

0.12 

0.47 

0.00 

0.12 

0.65 

polY_10 

27 

11 

16 

0.33 

0.67 

0.00 

0.67 

1.00 

fast_mu 

35 

7 

21 

0.29 

0.55 

0.05 

0.55 

0.89 

average 

0.35 

0.48 

0.06 

0.45 

0.89 

meta_qsort 

10 

7 

11 

0.30 

0.00 

0.10 

0.00 

0.40 

crypt 

18 

9 

12 

0.00 

0.61 

0.11 

0.56 

0.72 

prover 

22 

9 

13 

0.27 

0.09 

0.27 

0.14 

0.68 

browse 

42 

14 

20 

0.24 

0.45 

0.05 

0.40 

0.74 

boyer 

62 

25 

31 

0.27 

0.00 

0.06 

0.00 

0.34 

flauen 

83 

28 

34 

0.27 

0.08 

0.16 

0.11 

032 

sdda 

87 

32 

45 

0.18 

0.07 

0.17 

0.08 

0.44 

reducer 

134 

41 

50 

0.13 

0.10 

0.05 

0.12 

029 

unify 

141 

29 

84 

0.18 

0.19 

0.14 

021 

0.56 

nand 

180 

43 

5900 

OJ26 

0.67 

0.00 

028 

0.93 

simplc.analyzer 

270 

71 

77 

023 

0.10 

0.08 

0.10 

0.41 

chat_parser 

744 

156 

263 

0.44 

0.19 

0.02 

0.09 

0.67 

average 

0.23 

021 

0.10 

0.17 

0.56 

Table  7.7  presents  data  about  the  efficiency  of  the  dataflow  analyzer.  For  each  benchmark  it  gives 
the  number  of  predicate  arguments  (Args)  where  a  predicate  of  arity  N  is  counted  as  N,  the  number  of 
predicates  (Freds),  the  analysis  time  (Time),  the  fraction  of  arguments  that  are  uninitialized  (uninit),  ground 
(ground),  nonvariable  (nonvar),  or  recursively  dereferenced  (idereO.  and  the  fraction  of  arguments  that 
have  any  of  these  types  (any).  Analysts  time  is  measured  under  Quintus  release  2.0  on  a  Sun  3/60.  It  is 
roughly  proportional  to  the  number  of  arguments  in  the  program,  except  for  the  nand  benchmark.  The  sum 
of  the  individual  riKxlcs  columns  is  usually  greater  than  the  any  modes  column.  This  is  because  arguments 
can  have  multiple  modes — they  can  be  both  recursively  dereferenced  and  ground  or  nonvariabic.  Unini¬ 
tialized  arguments  are  present  in  great  quantities,  even  in  large  programs  such  as  chaijMtscr  and 
simple.analyzer.  Comparing  the  small  and  large  benchmarks,  the  fraaion  of  derived  modes  decreases  for 


ihc  large  programs  for  each  type  except  nonvariable.  For  both  the  small  and  large  benchmarks  the  analyzer 
transforms  one  third  of  (he  uninitialized  modes  into  uninitialized  register  modes. 

j 

4.  The  effectiveness  of  the  determinism  transformation 

To  show  what  parts  of  the  determinism  transformation  of  Chapter  4  are  the  most  effective,  it  is  useful 
to  define  a  spectrum  of  determinism  extraction  algorithms  ranging  from  pure  WAM  to  the  full  mechanism 
of  the  Aquarius  compiler.  To  do  this,  the  Aquarius  mechanism  for  exuacting  determinism  is  divided  into 
three  orthogonal  axes: 

(1)  The  kind  of  tests  used  to  extract  determinism.  These  tests  are  separated  into  three  classes:  explicit 
unifications  (c.g.  x=a,  x-s(Y)),  arithmetic  tests  (c.g.  x<Y,  x>l),  and  type  checks  (e.g. 
var(X).  atomic  (X)).  Pure  WAM  uses  only  explicit  unifications  with  nonvariables.  Aquarius 
uses  all  three  kinds. 

<2)  Which  argumcnt(s)  are  used  to  extract  determinism.  Pure  WAM  uses  only  the  first  argument  of  a 
predicate.  Aquarius  uses  any  argument  that  it  can  determine  is  effective.  It  uses  enrichment  heuris¬ 
tic  2  (Chapter  4  section  6.2). 

(3)  Whether  the  factoring  transformation  is  performed  (Chapter  4).  Factoring  significantly  increases 
determinism  for  predicates  that  contain  many  identical  compound  terms  in  the  head.  Pure  WAM 
does  not  assume  factoring.  Aquarius  does  factoring  by  default. 

These  three  parameters  define  a  three-dimensional  space  of  determinism  exuaction  algorithms.  Each  algo¬ 
rithm  is  characterized  by  a  3-tuple  depending  on  its  position  on  each  of  the  axes  (Table  7.8).  This  results  in 
3  X  2  X  2  s  12  data  points.  Pure  WAM  selection  corresponds  to  the  first  element  in  each  column,  denoted 
by  the  3-tuple  (U,  ONE,  NF).  The  Aquarius  compiler’s  selection  corresponds  to  the  last  element  in  each 
column,  denoted  by  the  3-tuplc  (UAT,  ANY.  F). 

For  each  of  these  12  points  three  parameters  were  measured:  execution  time,  static  code  size,  and  compile 
time.  Ail  programs  are  compiled  w'ith  dauftow  analysis  and  executed  on  the  VLSl-BAM.  All  averages  are 
geometric  means.  It  was  only  possible  to  do  measurements  for  nine  benchmarks:  nreversc,  qson,  query, 
mu,  fasi.mu,  quccns_8,  flatten,  meta_qsort,  and  nand.  Therefore  the  variance  of  the  results  is  large  and 
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Table  7.8  -  Three  dimensions  of  detenninism  exuaction 

Kind  of  test 

Which  argument 

Factoring 

Explicit  unifications  only  (U). 

First  argument  only  (ONE). 

No  faaoring  (NF). 

Explicit  unifications  and  arithmetic 
tests  (UA). 

Any  argument  (ANY). 

Do  factoring  (F). 

Explicit  unifications,  arithmetic 
tests,  and  type  checks  (UAT). 

they  can  be  relied  upon  only  to  indicate  trends.  The  benchmarks  were  written  for  the  WAM.  The  meas¬ 
urements  compare  only  the  relative  powers  of  different  kinds  of  determinism  extraction  in  the  BAM.  They 
do  not  compare  the  WAM  and  BAM  directly. 


Aquarius  selection  |  UAT.  ANY.  F|  0 


__  percent  slowdown 
relative  to  Aquarius 

^  ^difference  between 
2“^  "  1^0  vertices 


UA.ANY.F|2 


3  ^  2  ^0 

WAM  selection  (U.  ONE.  NF  |  16 


Figure  7. 1  -  The  effectiveness  of  determinism  exuaction 


Figure  7. 1  depicts  the  1 2  points  as  a  lattice.  Each  vertex  denotes  one  particular  combination  of  deter¬ 
minism  extraction.  The  top  element  corresponds  to  Aquarius  selection  and  the  bottom  element  corresponds 
to  WAM  selection.  Each  edge  connects  two  points  that  differ  by  one  step  in  one  coordinate.  The  vertices 


154 


arc  marked  with  (he  percent  slowdown  compared  to  Aquarius  selcciion.  The  edges  are  marked  with  the 
percent  difference  in  execution  lime  between  their  two  endpoints. 

The  mean  speedup  for  the  nine  benchmarks  when  going  from  WAM  selection  (U.  ONE,  NF)  to 
Aquarius  selection  (UAT,  ANY.  F)  is  16%.  There  is  no  significant  change  in  mean  code  size  for  any  of  the 
twelve  data  points.  The  variance  of  the  compile  time  is  too  large  to  make  any  conclusions  about  it. 

The  mean  speedup  of  factoring  is  8%.  However,  factoring  is  the  only  transformation  that  sometimes 
slows  down  execution.  The  factoring  hcuristic.shou!d  be  refined  to  look  inside  compound  arguments  to  see 
whether  there  is  any  potential  determinism  (here.  If  there  is  none,  it  should  not  factor  that  argument 

One  way  of  finding  a  set  of  effective  extensions  for  determinism  extraction  is  by  traversing  the  lattice 
from  bottom  to  top,  and  picking  the  edge  with  the  greatest  performance  increase  at  each  vertex.  Starting  at 
WAM  selection  (U,  ONE,  NF),  the  first  extension  is  the  ability  to  use  arithmetic  tests  in  selection.  This 
speeds  up  execution  by  3%.  The  second  extension  is  the  ability  to  select  on  any  argument.  This  ^)eeds  up 
execution  by  another  3%.  The  third  extension  is  the  factoring  transformation.  This  speeds  up  execution  by 
8%.  At  this  point,  the  resulting  performance  is  within  2%  of  Aquarius  selection. 

The  resulting  vertex  (UA,  ANY,  F)  seems  to  be  a  particularly  good  one.  i.e.  the  ability  to  selea  on 
arithmetic  tests  in  any  argument  works  well  together  with  factoring.  Leaving  out  any  one  of  these  three 
extensions  reduces  performance  by  at  least  8%.  A  plausible  reason  for  this  result  is  that  the  benchmarks  do 
many  arithmetic  tests  on  the  arguments  of  compound  terms  and  it  is  only  the  combination  of  the  three 
extensions  that  is  able  to  compile  this  deterministically. 

S.  Prolog  and  C 

The  performance  of  Aquarius  Prolog  is  significantly  better  than  previous  Prolog  systems.  A  question 
one  can  pose  is  how  the  system  compares  with  an  implementation  of  an  imperative  language.  This  section 
presents  a  comparison  of  Prolog  and  the  C  language  on  several  small  programs.  The  comparison  is  not 
exhaustive — there  are  so  many  factors  involved  that  I  do  not  attempt  to  address  this  issue  in  its  entirety.  I 
intend  only  to  dispel  the  notion  (hat  implemenutions  of  Prolog  are  inherently  slow  because  of  its  expres¬ 
sive  power.  A  serious  comparison  of  two  languages  requires  answering  the  following  questions; 


(1)  How  can  implemeniaiions  of  different  languages  be  compared  fairly?  This  comparison  concentrates 
exclusively  on  the  language  and  ignores  features  external  to  the  language  itself,  such  as  user  inter¬ 
face,  development  time,  and  debugging  abilities.  One  method  is  to  pick  problems  to  be  solved,  and 
then  to  wntc  the  “best”  programs  in  each  language  to  solve  the  problems,  choosing  the  algorithms 
appropriate  for  each  language.  The  disadvantages  of  this  approach  arc  (a)  different  languages  are 
appropriate  for  different  problems,  (b)  how  docs  one  decide  when  one  has  writicn  the  "best”  pro¬ 
gram?  To  avoid  these  problems  1  have  chosen  to  compare  algorithms,  not  programs. 

(2)  Which  algorithms  will  be  implemented  in  both  languages?  Ideally  one  should  selea  a  range  of  algo¬ 
rithms,  from  those  most  suited  to  imperative  computations  (e.g.  array  computations)  to  those  most 
suited  to  symbolic  computation  (e.g.  large  dynamic  data  objects,  pattern  matching).  Prolog  is  at  an 
advantage  at  the  symbolic  end  of  the  spectrum  because  to  implement  symbolic  computations  in  an 
imperative  language  wc  effectively  have  to  implement  more  and  more  of  a  Prolog-like  system  in  that 
language.  The  programmer  does  the  work  of  a  compiler.  At  the  imperative  end  of  the  spectrum,  the 
efficiency  of  Prolog  depends  strongly  on  the  ability  of  the  compiler  to  simplify  its  general  features. 

(3)  What  programming  style  will  be  used  in  coding  the  algorithms?  I  have  made  an  attempt  to  program 
in  a  style  which  is  acceptable  for  both  languages.  This  includes  choosing  dau  types  in  both 
languages  that  arc  natural  for  each  language.  For  example,  in  Prolog  dynamic  data  t>rfp-<;«yd  by 
pointers  is  easiest  to  express,  whereas  in  C  static  arrays  are  easiest  to  express.  It  is  possible  to  use 
dynamic  data  in  C,  but  it  requires  more  effon  and  is  used  only  for  those  tasks  that  need  it  specifically. 

(4)  How  are  architectural  features  taken  into  account?  For  fairness  both  implemenutions  should  run  on 
the  same  machine.  The  measurements  use  the  same  processor,  the  MIPS,  for  both  implementations. 
However,  a  general-purpose  architecture  favors  the  execution  of  imperative  languages,  since  it  has 
been  designed  to  execute  such  languages  well.  This  shows  up  for  algorithms  whose  Prolog  imple¬ 
mentation  makes  heavy  use  of  Prolog-specific  features.  To  allow  the  reader  to  make  an  informed, 
judgment,  the  table  does  not  correct  for  this  facL  It  is  important  to  bear  in  mind  that  by  adding  addi¬ 
tional  architectural  features  comprising  5%  of  the  chip  area  to  the  VLSf-BAM  (a  pipelined  processor 
similar  in  many  ways  to  the  MIPS),  the  performance  increases  by  5091.  for  programs  that  use 
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Prolog-specific  features  (compiled  with  the  current  version  of  the  Aquarius  compiler).  Ardiiieaural 
studies  done  by  our  research  group  suggest  that  these  features  could  be  added  to  a  future  MIPS  pro¬ 
cessor. 

Tabic  7.9  compares  the  execution  time  of  small  algorithms  coded  in  both  C  and  Prolog  on  a  25  MHz  MIPS 
processor.  Measurements  arc  given  for  tak.  fib,  and  hanoi.  which  arc  recursion-intensive  integer  functions; 
and  for  quicksort,  which  sons  a  50  element  list  10000  times.  Prolog  and  C  source  code  is  available  by 
anonymous  ftp  to  arpa.bcrkclcy.edu.  In  all  ca^s  the  user  time  is  measured  with  the  Unix  ‘‘time"  utility. 
The  C  versions  are  compiled  with  the  standard  MIPS  C  compiler  using  both  no  optimization  and  the  optim¬ 
ization  level  that  produces  the  fastest  code  (usually  level  4).  The  Prolog  versions  are  compiled  with 
dataflow  analysis  and  translated  into  MIPS  assembly  by  a  partial  translator.  ‘The  same  algorithms  were 
encoded  for  both  Prolog  and  C,  in  a  natural  style  for  each.  The  natural  style  in  C  is  to  use  static  data, 
whereas  in  Prolog  all  data  is  allocated  dynamically. 


Table  7.9 

-  Comparing  Prolog  and  C  (in  sec) 

Benchmark 

Aquarius 

MIPSC 

Prolog 

Unopiimized 

Optimized 

tak(24.16.8) 

1.2 

2.1 

1.6 

fib(30) 

1.5 

2.0 

1.6 

hanoi(20,l  .2,3) 

1.3 

1.6 

13 

quicksort 

2.8 

3.3 

1.4 

Recursive  functions  arc  fast  in  Prolog  for  three  reasons;  last  call  optimization  converts  recursion  into 
iteration,  environments  (stack  frames)  arc  allocated  per  clause  and  not  per  procedure  as  in  C,  and  outputs 
are  returned  in  registers  (they  arc  of  uninitialized  register  type).  Last  call  optimization  allows  functions 
with  a  single  recursive  call  to  execute  with  constant  stack  space.  *rhis  is  essential  for  Prolog  because  recur¬ 
sion  is  its  only  looping  construct.  The  MIPS  C  compiler  does  not  do  last  call  optimization.  C  has  con¬ 
structs  to  denote  iteration  explicitly  (e.g.  "for”  and  "while"  loops)  so  it  does  not  need  this  optimization  as 
strongly.  The  time  for  fib(30)  ,  the  only  recursive  integer  function  that  is  not  able  to  use  last  call 
optimization  in  Prolog,  is  closest  to  C. 

‘The  two  quick.son  implementations  are  careful  to  use  the  same  pivot  elements.  The  C  implementa¬ 
tion  uses  an  array  of  integers  and  does  in-place  sorting.  The  Prolog  implementation  uses  lists  and  creates  a 
new  sorted  list.  The  list  representation  needs  two  words  to  store  each  dau  element.  Coincidcnully,  the 
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Prolog  version  is  twice  as  slow  as  the  C  version,  the  same  as  the  ratio  of  the  dau  sizes. 


Tabic  7. 10  -  Classification  of  bug  types 

Kind 

Description 

% 

Mistake 

A  part  of  the  compiler  that  is  incorrect  due  to  an  oversighL  When  many  mis¬ 
takes  occur  related  to  one  particular  area,  then  they  become  hotspot  bugs. 

39 

•  Local 

A  problem  that  can  be  fixed  by  changing  just  a  few  predicates.  For  example,  it 
may  be  due  to  a  typographical  error  or  a  simple  oversight  in  a  predicate 
definition. 

(37) 

•  Global 

A  problem  that  can  be  fixed  only  with  many  changes  throughout  the  compiler. 
This  kind  of  mistake  is  more  fundamental.  For  example,  avoiding  the  genera¬ 
tion  of  BAM  instructions,  with  double  indirections  requires  many  small 
changes. 

P) 

Incomplete 

t 

A  pan  of  the  compiler  whose  first  implementation  is  incomplete  because  of  in¬ 
complete  understanding  of  its  purpose.  Later  use  stretches  it  beyond  what  it 
was  intended  to  do,  so  that  it  needs  to  be  extended  and/or  cleaned  up.  For  ex¬ 
ample,  the  updating  of  type  formulas  when  new  information  is  given. 

19 

Hotspot 

A  critical  area  of  the  compiler  that  requires  much  thinking  to  get  correct  Its 
importance  is  much  greater  than  its  size  would  indicate.  Such  an  area  gets 
more  than  its  share  of  mistakes. 

16 

.•  Conceptual 

A  concept  in  the  compiler  design  whose  implementation  is  prone  to  many  mis¬ 
takes.  For  example,  the  concept  of  uninitialized  variables. 

(13) 

•  Physical 

A  part  of  the  compiler’s  text.  For  example,  symbolic  unification  in  the 
dataflow  analyzer  and  parameter  passing  in  the  clause  compiler  both  resulted 
in  many  bugs. 

(14) 

Mixture 

An  undesired  interaction  between  separate  parts  of  the  compiler.  Despite 
careful  design,  often  the  separate  transformations  and  optimizations  are  not 
complaely  orthogonal,  but  interact  in  some  (usually  limited)  way.  For  exam¬ 
ple,  maintaining  consistency  between  the  dataflow  analyzer  and  the  clause 
compiler.  This  leads  to  the  dereference  chain  transformation,  which  in  its  turn 
leads  to  the  problem  of  interaction  between  it  and  the  preferred  register  alloca¬ 
tion. 

16 

Improvement 

A  possible  improvement  in  the  compiler.  This  is  not  strictly  a  bug.  but  it  may 
point  to  an  important  optimization  that  could  be  added  to  the  compiler.  For 
example,  a  possible  code  optimization  or  reduction  in  compilation  time. 

9 

Understanding 

A  problem  due  to  the  programmer  misunderstanding  the  required  input  to  the 
compiler.  This  is  not  strictly  a  bug,  but  it  may  point  to  difficulties  in  the 
compiler's  user  interface  oc  in  the  language.  For  example,  the  difference 
betw>ecn  the  terms  _is_  and  _<_  in  Prolog.  The  first  is  a  variable  and  the 
second  is  a  structure. 

4 

6.  Bug  analysis 


This  section  gives  an  overview  of  the  numba  and  types  of  bugs  encountered  during  compiler 
development.  A  bug  in  a  program  is  a  problem  that  leads  to  incorrect  or  undesired  behavior  of  the  pro¬ 
gram.  In  the  compiler,  this  means  incorrect  or  slow  compilation,  or  slow  execution  of  compiled  code. 
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Tabic  7.10  classifies  (he  bugs  found  during  developmeni  [76].  (The  perceniages  do  not  add  up  lo  1(X)% 
because  bugs  can  be  of  more  than  one  lypc.) 

The  development  of  (he  compiler  staned  early  1988  and  proceeded  until  late  1990.  An  extensive 
suite  of  test  programs  was  maintained  to  validate  versions  of  the  compiler.  The  test  suite  was  continually 
extended  with  programs  that  resulted  in  bugs  and  with  programs  from  external  sources.  Records  were  kept 
of  all  bugs  reported  by  users  of  the  compiler  other  than  the  developer.  A  total  of  79  bug  reports  were  sent 
from  January  1989  to  August  1990  by  five  users.  The  frequency  of  bug  reports  stayed  constant  near  four 
per  month.  Statistical  analysis  is  consistent  with  the  distribution  being  random  with  no  time  dependence, 
i.e.  the  number  of  bug  reports  fluctuates,  but  there  is  no  increasing  or  decreasing  trend.  Therefore  the 
development  introduced  bugs  at  about  the  same  rate  as  they  were  reported  and  fixed.  This  coincidence  can 
be  explained  by  postulating  that  the  time  spent  developing  was  limited  by  (he  necessity  of  having  to  spend 
time  debugging  to  maintain  a  minimum  level  of  robustness  in  the  compiler.  This  is  consistent  with  my  per¬ 
sonal  experience  during  the  development  process. 


Chapter  8 

Concluding  Remarks  and  Future  Work 


“So  many  things  arc  possible  just  as  long 
as  you  don’t  know  they’re  impossible.” 

-  Norton  Justcr,  The  Phantom  Tollbooth 

1.  Introduction 

In  this  chapter  I  recapitulate  the  main  result  of  this  dissertation,  distill  some  practical  lessons  learned 
in  the  design  process,  talk  about  the  caveats  of  language  design,  and  give  directions  for  future  research. 

2.  Main  result 

< 

My  thesis  is  that  logic  programming  can  execute  as  fast  as  imperative  programming.  For  this  pur¬ 
pose  1  have  implemented  a  new  optimizing  Prolog  compiler,  the  Aquarius  compiler.  The  driving  force  in 
the  compiler  is  to  specialize  the  general  mechanisms  of  Prolog  (i.e.  the  logical  variable,  unification, 
dynamic  typing,  and  backtracking)  as  much  as  possible.  The  main  ideas  in  the  compiler  are:  the  develop¬ 
ment  of  a  new  abstract  machine  that  allows  more  optimization,  a  mechanism  to  generate  efficient  code  for 
deterministic  predicates  (converting  backtracking  to  conditional  branching),  q)ecialization  of  unification 
(encoding  each  occurrence  of  unification  in  the  simplest  possible  way),  and  the  use  rtf'  global  dataflow 
analysis  to  derive  types. 

The  resulting  system  is  significantly  faster  than  previous  implementations  and  is  competitive  with  C 
on  programs  for  which  dataflow  analysis  is  able  to  do  sufficiently  well.  It  is  about  five  times  faster  than 
(^intus  Prolog,  a  popular  commercial  implementation. 

3.  Practical  lessons 

During  the  design  of  this  compiler  1  have  found  four  principles  useful. 

(1)  Simplicity  is  common.  Most  of  the  tintc,  only  simple  ca.<ics  of  the  general  mechanisms  o/  the 
language  ate  used.  For  example,  most  uses  of  unification  arc  memory  loads  and  stores.  Many  of 
these  simple  cases  arc  easily  detected  at  compilc-time. 
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(2)  Use  (he  design  time  wisel)'.  There  are  many  possible  optimizations  that  one  can  implement  in  a 
compiler  of  this  son.  To  get  the  best  results,  rank  them  according  to  their  estimated  performance 
gain  relative  to  their  implementation  effon.  and  only  implement  the  best  ones.  Do  not  be  distraacd 
by  clever  ideas  unless  you  can  prove  that  they  are  effective. 

(3)  Keep  (he  design  simple.  For  each  optimization  or  transfonnation,  implement  the  simplest  version 
that  will  do  the  job.  Do  not  auempi  to  implement  a  more  general  version  unless  it  can  be  done 
without  any  extra  eflbn.  It  is  easy  to  become  entangled  in  the  mechanics  of  implementing  a  complex 
optimization.  Often  a  simple  version  of  this  optimization  achieves  most  of  the  benefits  in  a  iracuon 
of  the  time. 

(4)  Document  everything,  including  bugs.  Documentation  is  an  extension  to  one's  memory  and  it  pays 
for  itself  quickly.  The  mental  effon  spent  in  writing  down  what  one  has  done  results  in  a  beuer 
recollection  of  what  happened.  In  this  design.  I  have  maintained  two  logs.  The  first  is  a  file  in  chro¬ 
nological  order  that  documents  each  change  and  the  reason  for  it.  The  second  is  a  directory  contain¬ 
ing  bug  reports  contributed  by  the  users  of  the  compiler  and  brief  discussions  of  the  fixes. 

The  first  three  of  these  principles  arc  corollaries  of  what  is  sometimes  called  the  “80-20  rule":  80%  of  the 
results  are  obtained  with  20%  of  the  effort.  Using  this  principle  consistently  was  very  important  for  my 
work  and  for  the  BAM  project  as  a  whole. 

4.  Language  design 

The  Prolog  language  is  only  an  approximation  to  the  ideal  of  logic  programming.  During  this 
research,  our  group  has  grappled  with  some  of  the  deficiencies  of  Prolog.  There  are  deficiencies  in  (he  area 
of  logic;  Prolog's  approximation  to  rogation  (i.e.  negation-as-failure)  is  unsound  (i.e.  it  gives  incorrea 
results)  when  used  in  the  wrong  way.  Prolog's  implementation  of  unification  can  go  into  infinite  loops 
when  creating  circular  terms.  The  default  conuol  flow  is  too  rigid  for  dau-driven  programming. 

There  are  deficiencies  in  the  area  of  programming;  The  correspondence  between  a  program  and  its 
execution  efficiency  is  not  always  obvious.  Unification  is  only  able  to  access  the  surfaec  of  a  eomplex  data 
strucuirc.  Because  the  clauses  of  a  predicate  are  written  separately,  many  conditions  have  to  be  repeated  or 
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exira  predicates  have  to  be  defined.  There  is  a  sense  in  which  Prolog  is  a  kind  of  assembly  language. 

All  of  the  above  problems  have  solutions,  some  of  which  have  been  implemented  in  existing  systems 
and  in  the  Aquarius  system.  However,  for  three  reasons  I  have  resisted  the  impulse  to  change  the  bnguage 
more  than  just  a  little.  Fust,  of  all  logic  languages,  the  Prolog  language  has  the  largest  and  most  vigorous 
user  community,  and  this  is  a  resource  1  wanted  to  tap.  There  arc  many  programs  written  in  Prolog,  in  vari¬ 
ous  styles,  and  I  wanted  to  sec  if  this  existing  pool  of  ingenuity  could  be  made  to  run  faster.  Second,  it  is 
unwise  to  change  more  tlian  one  component  of  a  system  at  the  same  time,  especially  if  (hey  can  interact  in 
unpredictable  ways.  That  is,  one  should  not  design  a  new  language  and  '>  compiler  for  it  at  the  same  time. 
Third,  1  do  not  deem  myself  competent  yet  to  design  a  language.  1  believe  in  the  rule  of  bootstrapped  com¬ 
petence:  Before  writing  a  compiler,  write  programs.  Before  designing  a  language,  write  compilers.  Com¬ 
petence  in  each  task  is  limited  by  competence  in  its  prerequisite. 

-  The  best  languages  are  those  which  distill  great  power  in  a  small  set  of  features.  This  makes  such 
bnguaces  useful  as  tools  for  thought  as  well  as  for  implementation.  Practical  aspects  such  as  how  efficient 
it  can  be  implemented  are  as  important  in  a  good  language  design  as  theoretical  aspects.  A  good  bnguage 
is  theoretically  clean  (i.e.  easily  understood)  as  well  as  being  efficiently  implemeniable.  Examples  of  such 
bnguages  arc  Pascal  (many  algorithms  are  specified  in  an  Pascal-like  pseudo-code).  Scheme,  and  Prolog. 
To  create  such  a  bnguage,  a  person  must  have  completely  digested  a  set  of  ideas  as  well  as  have  a  large 
amount  of  practical  experience.  This  is  a  difficult  combination — it  is  easy  to  gloss  over  the  areas  one  docs 
not  know  well. 

5.  Future  work 

The  goal  of  achieving  parity  with  imperative  bnguages  has  been  achieved  for  the  class  of  programs 
for  which  dabflow  analysis  is  able  to  provide  sufficient  information,  and  for  which  the  determinism  is 
accessible  through  built-in  predicates.  To  further  improve  performance  these  limits  must  be  addressed. 

To  guide  the  removal  of  these  limits  it  is  important  to  build  brge  applications  and  study  the  interac¬ 
tion  between  programming  style  and  the  impicmenution.  This  is  a  problem  of  succcssiv.'  refinement.  A 
more  sophisticated  implementation  catalyzes  a  new<  style  of  programming,  which  in  its  turn  catalyzes  a  new 
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implementation,  and  so  forth.  The  first  step  in  this  process  was  the  development  of  the  first  Prolog  com¬ 
piler  and  the  WAM.  The  Aquarius  system  is  only  the  second  step.  It  is  able  to  generate  efficient  code  fi'om 
programs  written  in  a  more  logical  style  than  standard  Prolog.  How'cver,  the  limits  of  this  style  are  not  yet 
understood  as  they  are  in  the  WAM.  Further  work  in  this  area  will  lead  to  a  successor  to  Prolog  that  is 
closer  to  logic  and  also  efficiently  implcmcntable. 

5.1.  Dataflow  analysts 

When  writing  a  program,  a  programmer  commonly  has  a  definite  intention  about  the  data's  type 
(intending  predicates  to  be  called  only  in  certain  ways)  and  about  the  dau's  lifetime  (intending  data  to  be 
used  only  for  a  limited  period).  Because  of  this  consistency.  I  postulate  that  a  dataflow  anal>’zcr  should  be 
able  to  derive  this  information  and  a  compiler  should  be  able  to  take  advantage  of  iL 

There  has  been  much  good  theoretical  work  on  global  analysis  for  Prolog,  but  few  implementations, 
and  fewer  still  that  are  pan  of  a  compiler  that  takes  advantage  of  the  information.  Measurements  of  the 
Aquarius  system  show  that  a  simple  dataflow  analysis  scheme  integrated  into  a  compiler  is  already  quite 
useful.  However,  the  implementation  has  been  restricted  in  several  ways  to  make  it  practical.  As  programs 
become  larger,  these  restrictio'ns  limit  the  quality  of  the  results.  1  hope  the  success  of  this  experiment 
encourages  others  to  relax  these  restrictions.  For  example,  it  would  not  be  too  difficult  to: 

•  Extend  the  domain  to  represent  common  types  such  as  integers,  proper  lists,  and  nested  compound 
terms.  This  is  especially  imporuni  for  general-purpose  processors. 

•  Extend  the  domain  to  represent  variable  aliasing  explicitly.  This  avoids  the  loss  of  information  that 
affects  the  analyzer. 

•  Extend  the  domain  to  represent  data  lifetimes.  This  is  useful  to  replace  copying  of  compound  temas 
by  in-place  desuuctive  assignment.  In  this  way  dynamically  allocated  dau  becomes  static.  The  term 
“compile-time  garbage  collection”  that  has  been  used  to  describe  this  process  is  a  misnomer,  what 
is  desired  is  not  just  memory  recovery,  but  to  prc.scrvc  as  much  as  possible  of  the  old  value  of  the 
compound  term.  Often  a  new  compound  term  simitar  to  the  old  one  is  created  at  the  same  time  the 
old  one  becomes  inaccessible.  Destructive  assignment  is  used  to  modify  only  those  parts  that  are 
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•  Extend  the  domain  to  represent  types  for  each  invocation  of  a  predicate.  For  example,  the  analyzer 
could  keep  track  not  only  of  argument  types  for  predicate  dehnitions,  but  ol  argument  types  for- goals 
inside  the  definitions.  This  is  useful  to  implement  multiple  specialization,  i.e.  to  make  separate 
copies  of  a  predicate  called  in  several  places  with  different  types.  For  the  chat_parser  benchmark, 
making  a  separate  copy  of  the  most-used  predicate  for  each  invocation  results  in  a  performance 
improvement  of  14%. 

S.2.  Determinism 

The  second  area  in  which  significant  improvement  is  possible  is  determinism  extraction.  The 
Aquarius  compiler  only  recognizes  determinism  in  built-in  predicates  of  three  kinds  (unification,  arithmetic 
tests,  and  type  checking).  Often  this  is  not  enough.  In  many  programs,  user-defined  predicates  are  used  to 


choose  a  clause. 
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Appendix  A 

User  manual  for  the  Aquarius  Prolog  compiler 


1.  Introduction 

The  Aquarius  Prolog  compiler  reads  clauses  and  directives  from  sidin  and  outputs  Pndog-readable 
compiled  code  to  stdoui  as  one  fact  per  instruaion.  The  output  is  assembly  code  for  the  Berkeley  Abstract 
Machine  (BAM).  Directives  hold  starting  from  the  next  predicate  that  is  input.  Clauses  do  not  have  to  be 
contiguous  in  the  input  sueam,  however,  the  whole  stream  is  read  before  compilation  starts. 

This  manual  is  organized  into  ten  sections.  Section  2  documents  the  compiler’s  directives.  Seebon  3 
gives  the  compiler's  options.  Section  4  gives  a  ^ort  overview  of  the  dataflow  aitalysis  done  by  the  com¬ 
piler.  Section  S  gives  the  type  declarations  accepted  by  the  compiler.  Section  6  summarizes  the  differ¬ 
ences  between  Aquarius  Prolog  and  the  Edinburgh  standard.  Section  7  gives  an  example  showing  how  to 
use  the  compiler.  Section  8  describes  the  method  used  to  compile  specialized  entry  points  to  increase  the 
efficiency  of  buili-ins.  Section  9  describes  the  assembly  language  interface.  Section  10  describes  how  to 
define  BAM  assembly  macros. 

2.  Directives 

The  directives  recognized  by  the  Aquarius  compiler  are  given  in  Table  1. 

3.  Options 

The  Aquarius  compiler's  options  are  given  in  three  categories;  high-level  (these  options  control 
actions  of  the  compiler  at  the  Prolog  level),  architeaure-dependent  (these  options  are  constant  for  a  partic¬ 
ular  architecture),  and  low-level  (mainly  useful  for  debugging  purposes).  The  default  options  are  set  for 
the  VLSI-BAM  processor.  The  options  are  given  in  Tables  2, 3.  and  4. 

4.  Dataflow  analyse 

Dataflow  anal>'sis  is  enabled  with  the  analyze  option.  It  generates  ground,  nonvar,  recursively 
dereferenced  and  uninitialized  variable  types  which  are  merged  with  the  programmer’s  types.  Both  unini¬ 
tialized  memory  and  uninitialized  register  types  are  generated.  Entry  declarations  (given  by  entry 
directives)  are  used  to  drive  the  analysis.  Pnsdicates  of  arity  zero  are  always  used  as  entry  declarations. 
The  quality  of  the  generated  types  is  such  that  compilation  time,  execution  time,  and  co^  size  ate  all 
significantly  reduced.  Therefore  it  is  recommended  always  to  compile  with  analysis.  The  whole  program 
is  kept  in  memory  during  the  analysis. 

All  mode,  entry,  and  op  directives  are  executed  before  the  analysis  starts.  Other  directives  are 
executed  after  the  analysis  and  before  compilation.  The  directives  default  and  clear  interfere  with 
dataflow  analysis,  so  they  should  be  given  only  when  the  analyze  option  is  disabled. 

4.1.  Dataflow- analysis  and  dynamic  code 

The  compila  makes  the  distinction  between  static  and  dynamic  code.  Static  code  is  completely 
known  at  compile-timc  and  is  subjea  to  analysis.  Dynamic  code  is  created  at  run-time  by  the  built-in 
predicates  assert/1,  ret  r  act /l.  and  their  cousins.  It  is  not  analyzed.  There  are  two  cases  to  con¬ 
sider; 

(I)  A  dyiuuntc  predicate  calls  a  static  prediOKe.  In  this  case,  there  must  be  an  enuy  declaration  giving 
the  worst-case  type  of  the  call  for  each  sutic  predicate  that  might  be  called  by  a  dynamic  predicate. 
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Leaving  out  this  declaration  may  result  in  incorrect  compilation. 

(2)  A  static  predicate  calls  a  dynamic  predicate.  The  analyr^r  will  assume  worst-case  types  for  the 
dynamic  predicate  unless  it  has  a  type  declaration. 

The  most  common  uses  of  dynamic  code  arc  as  databases  of  facts,  or  as  rules  that  only  call  a  limited  set  of 
sutic  predicates.  For  these  uses,  there  is  no  problem  in  integrating  analyzed  static  code  with  dynamic  code. 

4.2.  Dataflow  analysis  and  (he  call/1  built-in 

The  ca  1 1  / 1  built-in  predicate  can  call  any  predicate  in  the  program  with  any  modes,  and  it  is  not 
possible  in  general  to  determine  these  predicates  and  their  modes  at  compile-time.  However,  most  pro¬ 
grams  that  use  call/1  will  call  one  of  a  known  set  of  predicates  or  will  call  a  dynamic  predicate.  There 
arc  three  cases  to  consider: 

(1)  If  the  set  of  predicates  that  may  be  arguments  of  call/1  is  known  by  the  programmer,  (hen  these 
predicates  should  be  given  entry  declarations  with  worst-case  modes.  (This  case  can  be  wriaen  more 
efficiently  by  writing  a  new  predicate  that  directly  calls  one  of  the  set,  and  avoids  calling  ca  1 1  / 1 .) 

(2)  If  the  predicates  that  may  be  arguments  of  call/1  arc  dynamic,  then  analysis  is  correct  without 
entry  declarations.  This  is  (rue  because  dynamic  predicates  are  not  analyzed. 

(3)  If  any  predicate  in  the  program  may  be  an  argument  of  call/1  and  nothing  is  known  about  the 
modes  ^'en  analysis  is  useless  and  it  should  not  be  done. 

5.  Types 

The  Aquarius  compiler  accepts  type  declarations  for  a  predicate.  Using  types  results  in  a  significant 
improvement  in  code  quality.  Types  are  rqiresented  as  (Head -.-Formula)  where  Head  contains 
only  variables  and  Formul  -r  is  a  logical  conjunction.  Almost  any  Prolog  test  can  be  used  in  a  type  for¬ 
mula.  Possible  type  formulas  are  given  in  Table  S.  This  representation  for  types  is  simple,  yet  powerful 
enough  to  represent  much  important  information  in  a  compact  way.  The  representation  generalizes  the 
declarations  of  Dcc-10  Prolog.  For  example,  the  Dec-10  declaration: 

mode  (concat  (+, -^/ -) ) 

is  expressed  here  as: 

model (concat (A,B,C) :-nonvar{A) ,nonvar (B)  ,vac(C) ) )  . 

6.  DifTerences  with  Edinburgh  Prolog 

Aquarius  Prolog  recognizes  new  type-checking  built-ins  which  arc  not  pan  of  the  Edinburgh  Prolog 
standard  as  embodied  by  C-Prolog.  The  new  built-ins  and  their  definitions  in  standard  Prolog  arc  given  in 
Tabic  6. 

7.  An  example  of  the  compiler’s  use 

The  following  example  shows  how  the  compiler  is  used: 


%  /hprg2/Bam/Compiler/coinpiler 


; -mode ( (a (A) ; -nonvar (A) ) ) . 
a  (a)  . 
a(b)  . 

'D 


%  Run  the  conpiler. 

%  Code  is  entered  directly. 

%  Enter  the  type. 

%  Enter  a  single  two-fact  predicate. 

%  End-of-file. 

%  The  output  follows : 


%  Cputime  between  start  and  finish  is  1.383 


procedure(a/l) . 

deref (r (0) , r (0) )  . 

hash (atomic,  r(0), 2,1  (a/1,1)). 

fail . 

labeKKa/l,!))  . 

pragma (hash_length (2) )  . 
pair (a, 1 (a/1, 3) )  . 
pair (b, 1 (a/1,4) )  . 
label (1 (a/1,  3) ) . 
label (1 (a/1,4)) . 
return . 


8.  Entry  specialization  for  more  efficient  built*ins 

The  directive  modal_entry  (Head,EntryTree)  adds  a  discriminati(»  tree  of  entry  points  for 
the  predicate  Head.  This  directive  is  used  by  (he  system  to  tiRpIemem  more  efbdcni  buih-ins.  h  is  not 
normally  needed  by  programmers,  although  (hey  can  take  advantage  of  it  for  other  predicates.  The  com¬ 
piler  uses  the  discrimination  tree  to  choose  (he  most  efficient  entry  point  for  each  call  of  a  predicate 
depending  on  the  type  formula  that  is  true  at  the  predicate's  call  The  syntax  of  the  discrimination  tree  in 
modal.entry  is: 

tree (entry (EntryHead) )  . 
tree (mode (Formula, TrueTree,Fal8eTree) ) 
tree (TrueTree) ,  tree (FalseTree) . 

EntryHead  is  the  entry  point  that  replaces  Head  arx)  Formula  b  a  type  formula.  Compilation  of  a 
the  predicate  Head  proa^  by  following  a  path  down  the  disaiminaiion  ucc.  If  the  formula  valid  when 
Head  is  called  implies  Formula  (hen  the  TrueTree  is  followed.  Otherwise  the  FalseTree  is  fol¬ 
lowed.  Tree  travel  stops  when  an  entry  point  entry  (EntryHead)  is  encountered.  At  that  point  the 
original  call  is  replaced  by  Ent  ryHead . 

9.  Interfacing  with  BAM  assembly  language  routines 

Prolog  predicates  can  efficiently  call  routines  wriuen  in  BAM  assembly  code  (the  compiler’s  output) 
or  in  the  target  machine's  assembly  language  (for  example.  VLSI-BAM.  MIPS,  or  MC68020  assembly 
code).  The  interface  with  both  low-level  languages  is  provided  through  the  five-argument  type  declaration. 
This  declaration  has  the  following  form; 

:-  mode(Head,  Require,  Before,  After,  Survive). 

Head  is  the  head  of  the  predicate.  Require  is  the  required  t>'pc  formula,  i.c.  the  formula  made  true  by 
the  compiler.  All  uninitialized  variable  types  (both  uninitialized  memory  and  uninitialized  register)  must 
be  part  of  the  required  formula.  Before  is  the  type  fonnuia  known  to  be  valid  before  the  call. 
After  is  the  type  formula  known  to  be  valid  after  the  call.  Survive  is  the  register  survive  (lag.  If  the 
flag  is  y  then  the  predicate  must  not  alter  the  values  of  any  argument  registers  (except  those  used  to  return 
a  result).  It  must  sa\e  and  restore  any  argument  registers  it  needs.  The  predicate  is  called  with  a 
simple_call  instruction  and  must  reuim  with  a  simple_return  instruction  (or  its  equivalent  in 


174 


VLSI-BAM  processor  assembly).  A  simple  call  may  not  be  nested.  It  is  more  efficient  than  a  standard  call 
because  it  docs  not  need  an  environment  frame  around  it  in  the  calling  routine. 

If  the  survive  flag  is  n  then  the  predicate  is  assumed  to  invalidate  all  argument  register  values.  In 
this  ease  the  argument  registers  arc  available  as  scratch  registers  and  the  calling  routine  will  create  an 
environment  frame. 

Efficient  parameter  passing  is  implemented  by  using  uninitialized  variables.  These  are  of  two  kinds; 
uninitialized  memory  and  uninitialized  register  variables.  An  uninitialized  memory  variable  is  a  pointer  to 
an  empty  memory  cell.  Binding  to  it  is  a  store  to  memory.  An  uninitialized  register  variable  is  an  empty 
register.  Binding  to  it  is  a  move  to  the  register.  No  trailing  ot  dereferencing  is  needed  in  either  ease. 

Declaring  an  argument  to  have  a  uninitialized  register  type  means  that  the  output  of  the  routine  is 
stored  in  the  corresponding  argument  register.  Similarly,  an  uninitialized  memory  type  requires  the  output 
to  be  stored  to  the  location  pointed  to  by  the  argument  register.  Inputs  and  outputs  must  be  put  in  separate 
registers. 

10.  Defining  BAM  assembly  language  macros 

It  is  possible  to  define  macros  in  the  Prolog  source  that  are  expanded  into  BAM  assembly  instruc¬ 
tions.  The  advantages  of  macros  are  that  they  do  not  have  call-return  overhead,  that  unnecessary  shuffling 
of  data  between  registers  is  avoided,  and  that  the  full  range  of  low-level  compiler  optimizations  is  per¬ 
formed  on  them.  A  macro  definition  has  the  following  form; 

macro ( (Head  Body)). 

where  Head  is  the  head  of  the  predicate  that  will  be  expanded  and  Body  is  a  series  of  BAM  instruc¬ 
tions.  For  example. 

mode(quad(A,  B) ,  uninit_reg (B) ,  true,  deref (B) ,  y) . 
macro! (quad (A, B)  add (A, A, X),  add(X,X,B) ) ) . 

The  macro  definition  is  preceded  by  a  mode  declaration  telling  that  the  second  argument  is  the  output 
Macro  definitions  must  obey  the  following  rules: 

(1)  All  legal  BAM  instructions  and  addressing  modes  are  allowed  in  the  macro  definition  including  user 
instructions,  except  as  noted  below.  User  instructions  are  never  generated  by  the  compiler,  but  they 
are  recognized  and  optimized  in  macro  definitions.  Labels  are  given  as  ground  terms  or  as  Prolog 
variables.  The  latter  are  given  unique  ground  values  by  the  compiler.  Registers  are  given  as  user 
registers  (c.g.  r  (hi  and  r  (t2 ) )  or  as  Prolog  variables  (e.g.  x  and  Y).  The  latter  arc  allocated 
by  the  compiler.  Do  not  use  numbered  registers  (r  ( 0 ) .  rdl,...). 

(2)  The  macro  definition  must  be  preceded  by  a  mode  declaration.  The  exit  modes  must  be  valid  upon 
exiting  the  macro.  All  head  arguments  that  return  results  must  be  of  uninitialized  register  type. 

(3)  The  macro  may  not  alter  any  of  (he  head  arguments  except  those  returning  a  result 

(4)  The  second  argument  of  (he  deref  (X,Y>  insouction  must  be  a  new  variable,  i.e.  it  must  not  have 
a  value  upon  entering  the  macro.  Failing  to  obey  this  constraint  will  lead  to  incorrea  behavior  on 
backtracking. 

(5)  It  is  not  recommended  to  create  choice  points  inside  macros  since  it  is  not  known  how  many  registers 
arc  live. 


Directive 


help, 
default . 


vlsi_plm. 


clear. 

option (Options) . 

nctopt ion (Options) . 

printoption. 

mode ( (Head: -Formula) ) 


entry ( (Head: -Formula ) ) 


mode  (H,  R>  B/ A/  S) , 


:-  entry(H,R,B,A,S) 


raodal_entry(H,T) . 


macro ( (Head: -Body) ) 


;-  include (FileName) . 


pass (Anything) . 


;-  version. 
op(A,B,C) . 


Action 


Print  a  summary  of  these  directives. 

Set  the  default  options  for  the  VLSI-BAM  processor  and  clear  all 
type  declarations  and  modal  entries. 

Ensure  compatibility  with  the  MIPS  processor.  This  directive 
should  occur  only  once  in  a  file.  It  sets  the  option  align(l),  dis¬ 
ables  the  option  split_intcger,  and  sets  all  other  options  to  their 
default  values.  It  clears  all  type  declarations  and  modal  entries. 
Ensure  compatibility  with  the  re-microcoded  VLSI-PLM.  This 
directive  should  occur  only  once  in  a  file.  It  sets  the  options 
high_reg(6)  and  align(I),  disables  the  option  split.integer,  and 
sets  all  other  options  to  their  default  values.  It  clears  all  type  de¬ 
clarations  and  modal  entries.  Trail  checks  and  shifts  are  com¬ 
piled  differently. 

Clear  all  type  declarations  and  modal  entries. 

Add  the  options  in  Options  to  the  current  options.  Op¬ 
tions  may  be  a  single  option  or  a  list  of  options. 

Remove  the  options  in  Options  from  the  current  options. 
Options  may  be  a  single  option  or  a  list  of  options. 

Print  a  list  of  the  currently  active  options. 

Type  declaration  for  a  predicate.  The  type  infomtation  is 
remembered  until  new  types  are  given  for  that  predicate  or  until 
all  type  information  is  cleared.  This  declaration  is  not  used  as  a 
starting  point  for  dataflow  analysis.  However,  the  types  generat¬ 
ed  by  dataflow  analysis  are  used  to  supplement  the  declaration, 
and  an  error  message  is  given  if  there  is  a  contradiction. 

Type  declaration  for  a  predicate — same  as  above.  This  declara¬ 
tion  is  also  used  as  a  starting  point  in  dataflow  analysis. 

Detailed  type  declaration  for  a  predicate.  This  declaration  is  use¬ 
ful  for  interfacing  with  assembly  language.  H  is  the  head,  R  is 
the  required  type  formula  (made  true  by  the  compiler  before  each 
call),  B  is  the  before  type  formula  (assumed  true  before  each 
call),  A  is  the  after  type  formula  (assumed  true  after  each  call),  S  j 
is  the  survive  flag  (y/n  depending  on  whether  the  call  lets  regis¬ 
ters  survive).  The  after  type  formula  is  used  by  dataflow  analysis 
to  improve  the  generated  types. 

Detailed  type  declaration  for  a  predicate — same  as  above.  This 
declaration  is  also  used  as  a  starting  point  in  dataflow  analysis. 
Optional  discrimination  tree  of  efficient  entry  points  for  the 
predicate  H.  The  tree  T  contains  type  formulas  used  to  replace 
each  call  of  the  predicate  by  a  mote  efficient  entry  point 
Macro  definition.  The  head  is  expanded  into  a  sequence  of  BAM 
assembly  instructions. 

Insert  the  text  of  the  file  FileName.  This  directive  may  be 
nested  up  to  the  system  limit  of  simultaneous  open  files. 

Pass  die  input  :  -  pass  (Anything)  .  ”  unaltered  to  the  out¬ 
put  in  Prolog-readable  form. 

Print  the  creation  date  of  this  version  of  the  compiler. 

Declare  an  operator  in  Prolog.  Pass  the  input  :  - 
op  (A,  B,  C)  . "  unaltered  to  the  output  in  Prolog-readable  form. 


Table  2  -  High-level  compiler  options 


Option 

Default 

Description 

select_ limit (L) 

U=1 

Perform  selection  for  up  to  L  arguments.  Selection  is  done 
according  to  the  enrichment  heuristic.  See  Chapter  4  section 
6.2.  . 

analyze 

off 

Perform  dataflow  analysis  for  all  predicates  in  the  input 
stream.  This  option  enables  analysis  of  the  entire  input 
stream,  no  matter  where  it  occurs  in  the  stream.  The  starting 
points  for  analysis  are  the  entry  declarations  and  all  predi¬ 
cates  of  arity  zero.  The  types  obtained  from  the  analysis  are 
merged  with  the  programmer's  types.  The  predicates  are 
then  compiled  with  the  merged  types. 

compile 

on 

Compile  the  input.  When  this  option  is  disabled,  the  entry 
types  generated  by  the  dataflow  analyzer  for  the  source  predi¬ 
cates  are  output  as  valid  Prolog-readable  type  declarations. 

factor 

on 

Do  factoring  source  transformation.  With  this  transformation 
similar  compound  terms  in  adjacent  heads  are  only  unified 
once.  Often  this  gives  faster  code. 

comment  . 

on 

Give  information  about  what  the  compiler  is  doing. 

same_number_solutions 

on 

Keep  the  same  number  of  solutions  on  backtracking  as  stan¬ 
dard  Prolog.  Relaxing  the  semantics  by  removing  this  option 
results  in  better  code  in  some  cases. 

same_order_solutions 

on 

Keep  the  same  order  of  solutions  on  backtracking  as  standard 
Prolog.  Relaxing  the  semantics  by  removing  this  option 
results  in  better  code  in  some  cases. 

depth_limit (D) 

D=2 

Nesting  depth  limit  on  unification  goals.  Unifications  deeper 
than  this  limit  are  transformed  to  remain  within  this  limit 
This  transformation  is  used  because  compilation  time  and 
code  size  for  deeply  nested  unifications  would  otherwise  in¬ 
crease  as  the  square  of  the  s'tze  of  the  unification. 

short  block (S) 

S=6 

Threshold  on  basic  block  length  for  shuffle  optimization. 

Table  3  -  Architecture-dependent  compiler  options 

Option 

Default 

Description 

low_reg (L) 

L=0 

Lowest  numbered  machine  register. 

high_reg (H) 

H=100 

Highest  numbered  machine  register.  In  the  VLSI-BAM  pro¬ 
cessor.  registers  higher  than  r(15)  are  mapped  into 
mer^^ory. 

low_perm(P) 

P=0 

Lowest  numbered  permanent  variable. 

hash_si2e  <H) 

H=5 

Minimum  size  of  a  hash  table. 

align (K) 

K=2 

Align  all  compound  terms  to  start  on  a  multiple  of  K. 

uni 

on 

Generate  unify.atomic  instruction  to  unify  with  an  atomic 
term. 

split  integer 

on 

Use  separate  tags  for  negative  and  nonnegative  integers. 

Option 

Default 

Description 

system(X) 

quintus 

The  system  running  the  compiler  (other  value;  cprolog). 

wrife 

on 

Write  the  object  code  when  compilation  is  complete. 

peep 

on  ! 

Do  peephole  optimization. 

stats  (S) 

off 

Print  timing  statistics  during  compilation.  S  is  one  of  the  fol¬ 
lowing  atoms,  or  a  list  of  them;  t  (top  level  of  compilation), 
c  (compilation  of  a  single  procedure),  p  (peephole  optimi¬ 
zation),  s  (selection  algorithm— extraction  of  determinism), 
d  (deterministic  code  generation). 

debug 

off 

Print  debugging  messages  during  compilation. 

Tabic  5  -  Type  formulas 


_ _  Mcanin 


A  is  a  nonvahable  term.  i.e.  its  main  functor  is  instantiated.  Nothing  is  implied 
about  its  arguments. 

A  is  a  ground  term.  i.e.  it  contains  no  unbound  variables. 

A  is  an  unbound  variable. 

A  is  an  uninitialized  memory  variable.  At  the  Prolog  level,  this  means  that  A  is  an 
unbound  variable  known  not  to  be  aliased  to  another  variable.  In  the  implcmentar 
tion.  A  is  a  pointer  to  an  empty  memory  cell.  Binding  to  this  variable  is  a  simple 
store,  without  dereferencing  or  trailing. 

A  is  an  uninitialized  register  variable.  At  the  Prolog  level,  this  has  the  same  mean¬ 
ing  as  an  uninitialized  memory  variable.  In  the  implementation,  A  is  an  empty 
machine  register.  This  type  increases  the  efficiency  of  parameter  passing  by  re¬ 
turning  a  value  directly  in  a  register.  It  is  useful  for  interfacing  with  assembly 
language. 

A  is  dereferenced. 

A  is. recursively  dereferenced,  i.e.  A  is  dereierenced  and  all  subierms  of  A  arc  re¬ 
cursively  dereferenced. 


A  is  a  structure. 

A  is  a  list,  i.e.  a  cons  cell  or  nil. 

A  is  a  cons  cell,  i.e.  a  non-nil  list. 
A  is  a  structure  or  a  cons  cell. 

A  is  the  structure  F  with  arity  N. 


A  is  an  atom. 

A  is  atomic,  i.e.  a  number  or  an  atom. 
A  is  atomic  or  an  unbound  variable. 

A  is  an  integer. 

A  is  a  floating  point  number. 

A  is  an  integer  or  a  float. 

A  is  a  negative  integer. 

A  is  a  nonncgaiive  integer. 

A  is  a  positive  integer. 

A  is  the  atom  x. 


Nothing  is  known  about  the  type. 

This  means  "execution  can  never  reach  this  point." 

This  means  "FI  and  F2.”  where  Fl  and  F2  arc  type  formulas. 
This  means  "Fl  or  F2,”  where  Fl  and  F2  arc  type  formulas. 
This  means  "not  F.”  where  F  is  a  type  formula. 


Type 


nonvar (A' 

ground (A) , 
var(A) 
uninit (A) 


uninit_reg (A) 


deref (A) 
rderef (A) 


structure (A) 
list (A) 
cons (A) 
compound (A) 
functor (A, F, N) 


atom(A) 
atomic (A) 
simple (A) 
integer (A) 
float (A) 
number (A) 
negative (A) 
nonnegative (A) 
A>0 
A=«x 


(F1,F2) 
(F1;F2) 
not (F) 


Tabic  6  -  New  type-checking  predicates  in  Aquarius  Prolog 


Predicate 

Prolog  Definition 

nil (A) 

-  nonvar(A),  A*=[]. 

cons (A) 

-  nonvar(A),  A=[_|_]. 

list (A) 

-  nonvar(A).  (A={]  ;  A=t_|_]). 

compound (A) 

-  nonvar(A),  \+atomic(A). 

structure (A) 

-  iionvar(A),  \+atomic  (A) ,  \+A=[_l_]. 

ground (A) 

-  nonvar(A),  functor (A,  N) ,  ground (N,  A). 

simple (A) 

-  (var(A)  ;  atomic(A)). 

negative (A) 

-  integer (A),  A<0. 

nonnegative (A) 

-  integer (A) ,  A>*0 . 

is  list (A) 

-  (var(A),  !  ;  A«[)  ;  A=[_|B],  is_list(B)). 

is_partial_list  (A) 

-  •(var(A),  !  ;  A«[_|B],  is__partial_list  (B) )  . 

is_proper_list (A) 

-  (var  (A) , ! ,  fail;A=  t]  ;A-=  [_|B] ,  is_proper_list  (B) )  . 

The  following  clauses  are  part  of  the  definition; 

ground (N,  )  N=:=0. 

ground (N,  A)  N=\=0,  arg(N,  A,  X),  ground (X),  N1  is  N-1,  ground (Nl,  A). 
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Appendix  B 

Formal  specification  of  the  Berkeley*  Abstract  Machine  syntax 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Formal  specification  of  the  Berkeley  Abstract  Machine  (BAM)  syntax 
%  Copyright  (C)  1989  Peter  Van  Roy  and  Regents  of  the  University  of  California 
%  May  be  used  and  modified  for  non-commercial  purposes  if  this  notice  is  kept. 
%  Written  by  Peter  Van  Roy. 

%  This  file  is  an  executable  Prolog  program  that  checks  the  syntactic 
%  correctness  of  BAM  instructions.  The  predicate  instr(I)  is  true  if  I  is 

%  a  legal  BAM  instruction.  In  addition  to  instructions  output  by  the  Aquarius 

%  compiler,  this  predicate  also  accepts  the  user  instructions  of  the  BAM, 

%  which  allow  the  run-time  system  to  be  written  completely  in  BAM  assembly. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  ***  Check  correctness  of  a  sequence  of  BAM  instructions  *** 

%  Create  saved  state: 

%  Note:  In  C-Prolog  this  must  be  started  up  in  a  system 
%  equal  to  in  size  or  larger  than  the  one  which  created  it. 

main  save  (check,  1),  prompt  (_,  read(Znstr),  pipednstr,  0,  0),  halt, 

main  :-  halt. 

%  Pipe  working  loop: 

pipe (end_of_f ile,  M,  N)  !, 

T  is  M+N, 

writeC'***  Checked  '), write (T) , write ('  instructions;  '), 
write (M) , write ('  correct  and  ') ,write(N) , write ('  incorrect .') ,nl . 
pipednstr,  M,  N)  :- 
(instr  dnstr) 

->  Ml  is  M+1,  N1«N 
;  Ml-M,  N1  is  N+1, 

write{'***  Incorrect:  ' )  ,writednstr)  ,nl 

), 

!,  read(NewInstr) ,  pipe (Newinstr,  Ml,  Nl) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%«%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 


%  ***  BAM  Instructions  *** 


%  1.  Unification  support  instructions: 


instr (deref (V, W) ) 
instr (equal (EA, A, L) ) 
instr (unify (V,W,F,G, L) ) 
instr (trail (V) ) 
instr (move (EA, VI) ) 
inst  r  (push  (EA,  R,  N) ) 
instr (adda (R,  S,  T) ) 


var_i(VJ , 
ea_e (EA) , 
var_i(V), 
var_i(V) . 
ea_m(EA) , 
ea_p(EA) , 
numreg (R) 


var_i  (W)  . 
arg_i(A),  Ibl(L). 

var_i (W) ,  nv_f lag (F) , nv_£lag (G) ,  Ibl (L)  . 

var_i(VI)  . 
hreg(R),  pos(N). 

,  numreg(S),  hreg(T). 
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inst  r (pad (N) ) 

instr (unify_atomic  (V,  I ,  L)  ) 
instr (fail) . 


;  -  pos (N)  . 

var  i(V),  an  atomic  (I),  IbKL) 


%  2.  Conditional  control  flow 
instr(switch(T,V, A,B,C) ) 
instr (choice (I/N, Rs, L) ) 
instr  (test (Eq, T, V, L) ) 
instr ( jump (C, A, B, L) ) 
instr (move (CH, V) ) 
instr (cut (V) ) 
instr(hash(T,R,N,L) ) 
instr (pair (E, L) ) 

%  3.  Arithmetic  instructions; 


instructions: 

a_tag(T),  .var_i{V),  Ibl  (A)  ,  Ibl  (B) ,  Ibl(C) 
pos  (I),  pos(N),  I=<N,  lbl{L),  regs(Rs). 
eq_ne(Eq),  var_i (V) ,  a_tag(T),  Ibl(L). 
cond(C),  numarg_i(A),  numarg_i{B),  Ibl(L). 
a_var(V),  choice_ptr (CH) . 
a_var (V) . 

hash_type (T) ,  reg(R),  pos(N),  Ibl(L). 
an  atomic (E),  Ibl(L). 


instr (add (A, B, V) ) 

- 

numarg_i (A) , 

numarg_i (B) , 

a_var (V) . 

instr(sub(A,B,V) ) 

- 

numarg_i  (A) , 

numarg_i (B) , 

a_var (V) . 

instr (mul (A, B,  V) ) 

- 

numarg_i (A) , 

numarg_i (B) , 

avar (V) . 

instr(div(A,B,V) ) 

- 

numarg_i (A) , 

numarg_i (B) , 

a_var (V) . 

instr (mod (A, B>  V) ) 

- 

numarg_i (A) , 

numarg_i (B) , 

a_var (V) . 

instr (and (A, B, V) ) 

- 

numarg_i (A) , 

numarg_i (B) , 

a_var(V)  . 

instr  (  or (A, B, V) ) 

- 

numarg_i (A) , 

numarg_i (B) , 

a_var(V)  . 

instr (xor (A, B, V) ) 
instr (not (A, V) ) 

_ 

numarg_i (A) , 
numarg_i  (A) , 

numarg_i (B) , 

a_var(V) . 
a~var(V) . 

instr(sll(A,B,V) ) 

- 

numarg_i  (A) , 

numarg_i (B) , 

a_var(V) . 

instr (sra (A,B,V) ) 

- 

numarg_i  (A) , 

numarg_i (B) , 

a_var (V) , 

instr(sll).  /•  vlsi_plm  only  */ 
instr(sra) .  /*  vlsi_plm  only  •/ 


%  4.  Procedural  instructions: 
instr (procedure (N/A) )  • 

instr (call (N/A) ) 
instr (return) . 
instr (simple_call (N/A) ) 
instr (simple_return) . 
instr (label (L) ) 
instr ( jump(L) ) 
instr  (allocate (Perms) ) 
instr (deallocate (Perms) ) 
instr (nop) . 


-  atom(N),  natural (A) 

-  atom(N),  natural (A) 

-  atom(N),  natural (A) 

-  Ibl(L) . 

-  Ibl(L) . 

-  natural (Perms) . 

-  natural (Perms) . 


%  5.  Pragma  information  for  translator  and  reorderer: 
instr (pragma (P) )  pragma (P). 

%  €.  Additions  to  BAM  for  the  assembly  language  programmer: 
instr(I)  user_instr (I) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Additions  to  BAM  for  the  assembly  language  programmer 


%  This  section  describes  the  parts  of  the  BAM  language  that  are  never  output 
%  by  the  compiler,  but  only  used  by  the  BAM  assembly  programmer.  This  is  used 
%  to  write  the  run-time  system  in  BAM  code,  so  that  it  is  as  portable  as 
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%  possible.  Additional  instructions  are  jump  to  register  address,  convert 
%  tagged  atom  or  integer  to  untagged  integer  (ord) ,  its  inverse  (val),  and 
%  non-trapping  full-word  unsigned  comparison,  non-trapping  full-word 
%  arithmetic,  and  trailing  for  backtrackable  destructive  assignment. 


user_instr ( jump_reg (R) ) 
user_instr ( jump_nt (C, A, B, L) ) 
user_instr (ord (A, B) ) 
user_instr (val (T, A, V) ) 
user_instr (add_nt (A, B, V) ) 
user_instr {sub_nt (A, B, V) ) 
user_instr (and_nt (A, B, V) ) 
user_instr(  or_nt (A, B, V) ) 
user_instr (xor_nt (A, B, V) ) 
user_instr (not_nt (A, V) ) 
user_instr (sll_nt (A, B, V) ) 
user_instr,(sra_nt  (A,  B,  V)  ) 
user  instr (trail  bda (X) } 


reg(R).  • 

cond(C),  numarg_i(A),  numarg_i(B), 

arg(A),  a_var(B). 

a_tag (T) , 

numarg_i (A) , 

a_var (V) 

numarg_i  (A) , 

numarg_i (B) , 

a_var (V) 

numarg_i  (A) , 

numa  rg_i ( B ) , 

a_var (V) 

numarg_i  (A) , 

nximarg_i  (B) , 

a_var (V) 

numarg_i  (A) , 

numarg_i (B) , 

a_var (V) 

numaj:g_i  (A) , 

numarg_i (B) , 

a_var (V) 

n»jmarg_i  (A) , 

a_var (V) 

numarg_i  (A)  , 

numarg_i (B) , 

avar (V) 

numarg_i  (A) , 

numarg_i  (B) , 

a_var (V) 

a  var(X) . 

Ibl  (L)  . 


%  Additional  registers; 

%  See  Implementation  Manual  for  list  of  existing  registers. 
user_reg ( r (A) )  ;-  atom(A) . 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 


%  »*•  Pragmas  *•* 

%  A  variable  is  a  multiple  of  N. 

%  Inserted  just  before  loads  in  readmode  unification, 
pragma (align (V, N) )  a_var{V),  pos(N). 

%  Inserted  just  before  a  sequence  of  pushes  in  writemode  unification. 

%  (The  pushes  may  be  interleaved  with  non-memory  moves.) 
pragma (push (term (Size) ) )  pos(Size). 

pragma (push (cons) ) . 

pragma (push (structure (A) ) )  pos (A) , 

pragma (push (variable) ) . 

%  Specify  the  tag  of  a  variable. 

%  (This  is  useful  for  processors  without  explicit  tag  support.) 
pragma  (tag (V, T) )  a_var(V),  a_tag(T). 

%  Length  of  a  hash  table. 

pragma (hash_length (Len) )  pos(Len). 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 


%  *•*  Tags  *•* 


a_tag (tatm)  . 
a_tag(tint)  . 
a_tag (tneg) . 
a_tag (tpos) . 
a_tag (tstr )  . 


/•  atom  •/ 

/•  integer  •/ 

/•  negative  integer  •/ 

/•  nonnegative  integer  •/ 
/•  structure  •/ 
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a_tag(tlst).  /•  cons  cell  */ 
a_tag(tvar).  /•  variable  *f 

atom_tag (tatm) . 

pointer_tag (tstr) . 
pointer_tag (tlst) . 
pointer_tag (tvar) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  •**  Addressing  modes 

heap_ptr (r (h) ) . 
choice_ptr (r (b) ) . 

reg ( r { I ) ) '  : -  int ( I ) . 

reg(U)  user__reg  (U)  . 

« 

hreg(R)  ‘  reg(R). 

hreg(R)  heap_ptr(R) . 

perm(p(I))  natural (I). 

an_atoinic  (I)  int  (I). 

an_atomic  (T'A)  atoni(A),  atom_tag(T). 

an_atomic (T* (F/N) )  atom(F),  pos(N),  atom_tag(T). 

a_var(Reg)  reg (Reg). 

a_var(P€rm)  perm  (Perm)  . 

arg(Arg)  a_var(Arg) . 

arg(Arg)  an_atoinic  (Arg)  . 

var_i(Var)  a_var(Var) . 

var_i(fVar))  a_var(Var). 

arg_i(Arg)  var_i (Arg) . 

arg_i(Arg)  an_atomic (Arg) . 

numreg (Arg)  : -  reg (Arg) . 

numreg(Arg)  int (Arg). 

nuinarg_i  (Arg)  var_i(Arg). 

numarg_i (Arg)  int (Arg) . 

var_off ( IVar) )  a_var(Var) . 

var_of f ( IVar+I] )  a_var(Var),  pos'd). 

%  effective  address  for  equal: 
ea_e(Var)  a_var(Var). 

ea_e(VarOff)  var_of f (VarOf f ) . 

%  Effective  address  for  move: 


ea_m<Arg)  arg(Arg). 

€a_m(VarOf f )  var_of f (VarOf f ) . 

ea_m(Tag‘H)  pointer_tag (Tag) ,  heap_ptr{H). 

%  Effective  address  for  push: 
ea_p(Arg)  arg_i (Arg) . 

ea_p(Tag*H)  pointer_tag (Tag) ,  heap_ptr(H). 

ea_p (Tag* (H+D) )  pointer_tag (Tag) ,  pos(D),  heap_ptr(H). 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  •**  Miscellaneous 

eq_ne(eq).  /»  Equal  •/  , 

eq_ne(ne).  /*  Not  equal  */ 

cond(lts) ..  /*  Signed  less  than  */ 
cond(les).  /*  Signed  less  than  or  equal  •/ 
cond(gts) .  /*  Signed  greater  than  •/ 
cond(ges) .  /•  ‘Signed  greater  than  or  equal  •/ 
cond(eq).  /*  Equal  */ 
cond(ne) .  /*  Not  equal  */ 

hash_type (atomic) . 
hash_type (structure) . 

Ibl(fail) . 

lbl(N/A)  atom(N) «  natural (A). 

Ibl (1 (N/A, I) )  atom(N),  natural (A) ,  natural (I). 

nv_f lag (nonvar) . 
nv_flag(var) . 
nv_flag('  ?' )  . 

%  A  list  of  register  numbers: 

%  (May  contain  the  value  'no'  as  well) 
regs ( I ] ) . 

regs ( [RISet] )  (int(R);  R-no) ,  regs (Set). 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  ***  Utilities 

ground (X)  nonvar (X),  functor (X,  N),  ground (N,  X). 

ground (N,  _)  N*:-0. 

ground(N,  X)  N«\“0,  arg(N,  X,  A),  ground(A),  N1  is  N-1,  ground(Nl,  X). 

int(N)  integer (N). 

natural (N)  integer (N),  N>-0. 

pos(N)  integer (N),  N>0. 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
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Appendix  C 

Formal  specification  of  the  Berkeley  Abstract  Machine  semantics 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Formal  specification  of  the  Berkeley  Abstract  Machine  (BAM)  semantics 
%  Copyright  (C)  1990  Peter  Van  Roy  and  Regents  of  the  University  of  California 
%  May  be  used  and  modified  for  non-commercial  purposes  if  this  notice  is  kept. 

%  Written  by  Peter  Van  Roy, 

%  The  specification  is  a  Prolog  prog.ram  that  defines  the  meaning  of  BAM  in 
%  terms  of  its  execution  in  a  simple  memory  model.  It  runs  BAM  code  directly 
%  from  the  output  of  the  Aquarius  con^iler. 

%  The  specification  does  not  include  the  user  instructions  of  the  BAM  since 
%  their  behavior  depends  on  the  target  machine. 

%  The  specification  is  written  in  the  Extended  DCG  notation. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Meaning  of  registers: 

%  r(b)  Index  to  most  recent  choice  point. 

%  r(e)  Index  to  current  environment. 

%  r  .tr)  Top  of  trail  stack. 

%  r(h)  Top  of  heap  stack. 

%  r(hb)  Value  of  r(h)  at  last  choice  point  creation. 

%  r(pc)  Code  address < 

%  r(cp)  Continuation  pointer  for  code. 

%  r(tmp_cp)  Temporary  continuation  pointer  for  code,  used  only  in  5inple_call. 

%  r( retry)  Retry  address  for  backtracking,  only  exists  inside  choice  points. 

%  r(I)  Argument  and  temporary  register. 

%  p(I)  Location  on  current  environment. 

%  Types  stored  in  registers: 

%  r(e)  Contains  values  of  registers  (r(e),r<cp))  U  (p(0),  ...,  p(N-l)), 

%  where  N  is  the  size  of  the  environment. 

%  r(b)  Contains  values  of  registers  (r (e) , r(cp) , r (tr) , r (b) , r (h) , r (retry) )  U  RS, 
%  where  RS  is  a  subset  of  {r(0),  r(l),  — ). 

%  r(tr)  Contains  a  natural  number. 

%  r(h),  r(hb)  Contain  words  with  a  pointer  tag. 

%  r(pc),  r(cp)  Contain  natural  numbers  or  symbolic  labels. 

%  r(tmp_cp)  Contains  a  symbolic  label. 

%  r( retry)  Contains  a  symbolic  label. 

%  r{I)  Contains  a  word. 

%  p(l)  Contains  a  word. 

%  Comments: 

%  A  word  is  either  an  integer  or  a  structure  of  the  form  Tag'Value  where  Value 
%  is  a  natural  number  except  if  Tag*tatro,  in  which  case  Value  is  an  atom  or  a 
%  structure  (F/N)  where  F  is  an  atom  and  N  is  a  natural  number. 


# 


<• 
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%  A  symbolic  label  is  either  the  atom  'fail',  or  the  structure  F/N,  or  the 
%  structure  1(F/N, I),  where  F  is  an  atom  and  N  and  I  are  natural  numbers. 

%  r(cp)  is  stored  in  environments,  allowing  nested  calls. 

%  r(tmp_cp)  is  not  stored  in  environments,  allowing  only  one  level  of  call. 

%  However,  no  environment  is  needed  in  a  predicate  containing  a  simple_call, 

%  There  are  no  explicit  stacks  for  environments  or  choice  points;  registers 
%  r(e)  and  r(b)  each  contain  a  set  of  register  values. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 


%  Accumulator  declarations: 


%  Accumulators: 

acc_inf o (code. 

T, 

In, 

Out, 

table_command{T, In,Out) )  . 

acc_inf o ( Iblmap, 

T, 

In, 

Out, 

table_command (T, In, Out) )  . 

acc_info (regs. 

T, 

In, 

Out , 

table_command(T, In,Out) »  . 

acc_inf o (trail. 

T, 

In, 

Out, 

table_command (T, In, Out) ) . 

acc_inf o (heap,  • 

T, 

In, 

Out, 

heap_table_command (T, In, Out) ) . 

acc_info  (count'. 

T, 

In, 

Out, 

(Out  is  T+In) > . 

%  Predicate  declarations: 


%  Top  level 
pred_info ( 
pred_info ( 
pred_inf o ( 
pred_inf o ( 
pred_inf o ( 


execute,  0, 
instr_loop,  0, 
instr_loop_end,  1, 
instr,  1, 
numeric_pc,  2, 


[regs, heap, trail, code, Iblmap, count] ) . 
(regs, heap, trail, code, Iblmap, count] ) . 
[  regs , heap, t  ra i 1 , code , Iblmap , count ] ) . 
(regs,heap, trail, code, Iblmap] ) . 
(Iblmap] ) . 


%  Addressing  modes: 
pred_inf o ( 
pred_info( 
pred_info ( 
pred_inf o ( 
pred_inf o ( 
pred_info ( 
pred_info ( 
pred_info ( 
pred_info{ 
pred_info( 
pred_info( 
pred_inf o ( 
pred_info{ 
pred_inf o { 


heap. 

3, 

1 

heap] ) 

reg. 

3, 

(regs 

J) 

perm. 

3, 

(regs 

]) 

a_var. 

3, 

(regs 

]) 

var_i. 

3, 

(regs, heap] ) 

arg. 

2, 

(regs 

]) 

arg_i. 

2, 

(regs, heap]) 

numreg. 

2, 

(regs 

]) 

numarg. 

2, 

(regs, heap] ) 

var_of f , 

2, 

(regs, heap] ) 

imm_tag. 

2, 

( regs 

]) 

ea_e. 

2, 

(regs, heap] ) 

ea_m. 

2, 

(regs, heap] ) 

ea_jD, 

2, 

(regs, heap] ) 

%  Instruction  utilities: 


pred_info ( 
pred_info ( 
pred_inf o { 
pred_inf o ( 
pred_info ( 
pred_inf o ( 
pred_info ( 


deref_rtn,  2, 
deref_rtn_cont,  3, 
equal_rtn,  3, 
switch_rtn,  5, 
test_rtn,  4, 
jump_cond_rtn,  4, 
hash  lookup,  3, 


(regs, heap, trail] ) . 

( regs, heap, trail] ) . 

(regs, heap,  trail] )  . 

(regS/heap, trail] )  . 
(regs,heap,trail] )  . 
(regs,heap,trail] )  . 

(regs,heap, trail, Iblmap, code] ) . 
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predi_info(  hash_lookup_2,  3,  (regs, 

preci_info(  hash_indirect,  3,  I 

pred_info(  save_choice_regs,  2,  (regs, 

pred_info (restore_choice_regs,  2,  (regs, 
pred_info(  detrail_rtn,  2,  (regs, 

pred_info(  trail_rtn,  1,  (regs, 

pred_info(  cmp_trail,  2,  (regs, 

pred_info(  unify_rtn,  3,  (regs, 

pred_info(  unify_rtn_2,  3,  (regs, 

pred_info(  unify_rtn_2,  5,  (regs, 

pred_info(  unify_rtn_args_2,  6,  (regs, 

pred_info(  unif y_rtn_args_3,  1,  (regs, 

pted_info(  unify_atm,  3,  (regs, 

pred_info(  unify_end,  2,'  (yegs, 

pred_info(  unify_vatvar,  2,  (regs, 

pred_info(  get_size,  3,  ( 

pred_info{  arith,  4,  (regs. 


heap, trail, Iblmap, code] ) 

heap  1 )  . 

heap, trail] )  . 

heap, trail] )  . 

heap, trail) )  . 

heap, trail] )  . 

heap, trail] ) . 

heap, trail] )  . 

heap, trail] )  . 

heap, trail] )  . 

heap, trail] )  . 

heap, trail] )  . 

heap, trail] )  . 

heap, trail] )  . 

heap, trail] )  . 

heap  ] )  . 

heap  ) )  . 


pred_info ( 
pred_inf o ( 
pred_inf o ( 
pred_info ( 


write_rtn,  0,  (regs, heap, trail) ) 
write_rtn,  1,  (regs, heap, trail] ) 
write_arg,  2,  (regs, heap, trail] ) 
write  args,  3,  ( regs, heap, trail ] ) 


%  Implement  the  accumulator  commands: 


table_ 

command (ins  d,Val) , 

In, 

in) 

:-  insdn. 

I, 

Val) . 

table_ 

command (get (I, Val) , 

In, 

In) 

:-  get (In, 

I, 

Val)  . 

table 

command (set (I, Val) , 

In, 

Out) 

setdn. 

I, 

Val, 

%  Mask  off  tag  before  looking  up  heap  entry: 

heap_table_command(ins  (_*I,Val) ,  In,  In)  insdn,  I,  Val)  . 

heap_table_command (get (_tl, Val) ,  In,  In)  :-  get(In,  I,  Val). 

heap_table_command(set  (_*I,Val) ,  In,  Out)  setdn,  I,  Val,  Out). 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Initialization  and  runtime  options  •** 

:-  dynamic (bamspec_option/l)  . 


main  : - 

save (bamspec,  1 ) , 
pron¥)t  (_,  "  ) , 

<  copyright, 
execute 

;  error ((' Sorry,  the  executable  BAM  specification  has  failed.']) 

) , 

halt. 

main 

halt . 

copyright 

write (' Berkeley  Abstract  Machine  (BAM)  Executable  Specification'),  nl, 

write ('Copyright  (C)  1990  Peter  Van  Roy  and  '), 

write (' Regents  of  the  University  of  California'),  nl,  nl. 
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f lag_print (I)  bainspec_opt ion (print) ,  !,  write (' Executing  '),  write(I),  nl. 
f lag_print (_) . 

%  Look  up  syrrtiolic  label  to  get  a  numeric  PC; 
numeric_pc  (PC,  PC)  — >>  { integer  (PC)  ) ,  . 

numericjpc (PC,  NFC)  — >>  (get (PC,NPC) ] ilblmap. 

%  Read  in  the  instructions  and  create  the  code  array  and  label  map: 

%  The  code  array  gives  the  instruction  corresponding  to  each  PC  value. 

%  The  label  map  gives  the  PC  value  corresponding  to  each  symbolic  label. 
read_code (Code,  LblMap) 
read(Instr) , 

read_code ( Instr,  0,  Code,  LblMap). 

read_code (end_of_f ile,  Code,  LblMap)  !,  seal (Code),  seal (LblMap) . 

read_code ((: -Option) ,  PC,  Code,  LblMap)  !, 

asserta ( bams pec_opt ion (Option) ) , 
read (Nextlostr) , 

read_code (Next Instr,  PC,  Code,  LblMap). 
read_code (Instr,  PC,  Code,  LblMap) 
ins (Code,  PC,  Instr), 
insert_lblmap (Instr,  LblMap,  PC), 

PCI  is  PC+1, 
read (Nextinstr) , 

read_code (Nextinstr,  PCI,  Code,  LblMap). 

%  Add  an  entry  to  the  label  map: 

insert_lblmap (label (L) ,  LblMap,  PC)  I,  ins(LblMap,  L,  PC). 

insert_lblmap (procedure (P) ,  LblMap,  PC)  !,  ins (LblMap,  P,  PC). 

insert_iblmap (_,  _,  _) .  . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  ***  Top  level  execution 
execute  : - 

write ('Reading  BAM  code'),  nl, 

read_code (Code,  LblMap) , 

write (' Starting  execution'),  nl, 

execute (leaf ,  Regs,  leaf,  _,  leaf,  _,  Code,  LblMap,  0,  N) , 
write (' Executed  '),  write (N) ,^write ('  instructions.'),  nl, 
print_array (Regs) . 

execute (File) 

seeing (OldFile) , 
see (File) , 

read_code (Code ,  LblMap), 
seen, 

see (OldFile) , 

execute (leaf ,  Regs,  leaf,  _,  leaf,  _,  Code,  _,  LblMap,  _,  0,  N) , 
write ('Executed  '),  write (N),  write ('  instructions.'),  nl, 
print_array (Regs) . 
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execute  — >> 

[set (r (e) , leaf) ] ; regs, 

[set (r (b) ,  leaf) ] : regs, 

[set (r (h) ,  tvar‘0) ]; regs, 

(set (r (tr) , 0) ) : regs, 

(set (r (pc) , 0) 1 ; regs, 

(set (r (cp) ,global_success/0) ] :regs, 
instr (choice (1/2, ( ] ,global_f ailure/0) ) , 
instr_loop. 

%  Instruction  execution  loop: 
instr_loop  -->> 

(get (r (pc) ,  PC) ) : regs, 

I 

•  # 

instr_loop_end (PC) . 
instr_loop  — >> 

error ((' Attempt  to  execute  beyond  existing  code.']). 

instr_loop_end (write/1)  — »  !,  write_rtn,  instr (return) ,  instr_loop. 
instr_loop_end(nl/0)  — »  !,  nl,  instr (return) ,  instr_loop. 
instr_loop_end(global_success/0)  — »  !» 

write ('•**  Global  success  ***'),  nl. 
instr_loop_end(global_failure/0)  — »  !, 
wrif.e('***  Global  failure  ***'),  nl. 
instr_loop_end (fail)  — » 

instr (fail) , 

< 

•  f 

instr_loop. 

instr_loop_end (PC)  — >> 
nuineric_pc  (PC,  NPC) , 

%  Fetch: 

(get (NPC, Instr) ) : code, 

NPCl  is  NPC+1, 

(set (r (pc) ,NPC1) ] :regs, 

%  Execute: 

(1] : count, 
flag_print (Instr) , 
instr (Instr) , 

f 

•  # 

instr_loop. 

instr_loop_end(PC)  — » 

error ( ('Program  counter  is  ',PC1). 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  ***  BAM  Instructions 

%  1.  Unification  support  instructions:, 
instr (deref (V,W) )  — >> 
var_i(get,  V,  X), 
deref_rtn(X,  Y), 
var_i(set,  W,  Y) . 
instr (equal (EA, A,  L) )  — » 
ea  e(EA,  X), 


arg_i(A,  Y) , 

{IbKD  ), 

equal_rtn(X,  Y,  L) . 
instr (unify (V, W, F, G, L) )  -->> 
var_i(get,  V,  X), 
var_i(get,  W,  Y) , 

(nv_f lag (F) ) , 

(nv_flag (G) } , 
llbl(L) ), 

unify_rtn(X,  Y,  L) . 
instr (unify_atomic (V, I, L) )  -->> 
var_i(get,  V,  X), 
tan_aton\ic  (1)  ) , 
llbl(L) I, 

unify_rtn(X,  I,  L). 
instr (trail (V) )  — » 
var_i(get,  V,  X), 
trail_rtn (X) . 
instr (move (EA, VI) )  — » 
ea_m(EA,  X>, 
var_i(set,  VI,  X),  !. 
instr (push (EA, R, N) )  — >> 
ea_p (EA,  X) , 

{hreg(R) ), 

(get (R,y) ) :regs, 

(set (Y,X) ] :heap, 

(pos (N) ) , 

add_word(Y,  N,  YN) , 

I set (R, YN) J : regs . 
instr (adda (R, S,T) )  — » 

{hreg(R) | , 

(get (R, X) ] :regs, 
numreg(S,  Off), 
add_word(X,  Off,  NX), 
(hreg(T) ), 

(set (T,NX) ) : regs . 
instr (pad (N) )  — » 

(get (r  (h) ,H) ] :regs, 

(pos (N) ) , 

add_word(H,  N,  NewH) , 

(set (r (h) ,NewH) ) :regs. 


%  2.  Conditional  control  flow  instructions: 

instr  (choice  (l/N,Rs,  L) )  — »  (pos(N),  N>1,  cegs(Rs),  IbKl.))', 
save_choice_regs (Rs,  NewB) , 

{ins(NewB,  r (retry),  D), 

(get (r (tr) ,TR) ) :regs,  (ins(NewB,  r(tr),  TR)), 

(get(r(e),  E)]:regs,  {ins(NewB,  r(e),  E)), 

(get (r (cp) ,CP) ] : regs,  (ins (NewB,  r(cp),  CP)), 

(get(r(b),  B)]:regs,  (ins (NewB,  r(b),  B)), 

(get(r{h),  H)]:regs,  (ins (NewB,  r(h),  H)), 

(seal (NewB) ), 

(set(r(hb),  H)l:regs, 

(set (r (b) ,NewB) 1 : regs. 
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instr  (choice  (I/N,  Rs,  L)  )  — »  {pos(N),  pos(r),  KI,  KN,  cegs(Rs),  Ihl(L)},  !, 

(get (r(b) , B) ] : regs, 
restore_choice_regs (Rs,  B) , 

(set (B, r (retry) , L,NewB) 1 ,  0 

(set (r (b) , NewB) ] ; regs . 

instr(choice(N/N,Bs,L) )  — >>  {pos<N),  regs(Rs),  IbKD),  ?/ 

(get (r (b) ,B) ] :regs, 
restore_choice_regs (Rs,  B) , 

(get (B, r (b) ,NewB) 1 , 

(set (r (b) , NewB) ]: reos  0 

(get (NewB, c (h) , H) 1 , 

(set  (r  (hb)  ,  H) )  ;  regs  . 
instr(fail)  — >> 

(get (r (b) , B) ]: regs, 

{get(B,r(h),H)  ), 

(set (r (h) , H) ): regs,  # 

{get{B,,r(e),E)  (, 

(set (r (e) ,E) ] ; regs, 

(get{B,r(cpl,CP) ), 

(set  (r  (cp)  ,>CP)  ]  :  regs, 

(get (r (tr) ,CurTR) 1 : regs, 

(get(B,r(tr),01dTR)  ),  0 

detrail_rtn (CurTR,  OldTR) , 

(get (B, r (retry) , L) | , 

(set (r (pc) , L) ] :regs. 
instr (switch (T, V, A, B, C) )  — » 

{a_tag (T) ) , 

var_i(get,  V,  X) ,  0 

extract_tag (X,  TX) , 

(IbKA),  Ibl(B),  IbKO), 
switch_rtn<T,  TX,  A,  B,  C) . 
instr (test (Eq,T,V, L) )  — » 

(a_tag (T) ), 

var_i(get,  V,  X) ,  0 

extract_tag (X,  TX) , 

(eq_ne (Eg) ), 

(IbKL)  I, 

test_rtn(Eq,  T,  TX,  L) . 
instr (jump (C, A, B,L) )  — » 

(cond(C) 1 ,  • 

nuinarg(A,  XA) ,  {extract_value  (XA,  VA) ,  check_int  (XA)  ) , 
numar9(B,  XB) ,  (extract_value  (XB,  VB) ,  chec)c_int  (XB)  ) , 

(Ibl(L) ), 

juinp_cond_rtn(C,  VA,  VB,  L)  . 
instr  (move  (r  (b) ,  V) )  --» 

(get (r (b) ,B) ]: regs,  # 

a_var(set,  V,  B) . 
instr (cut (V) )  — >> 
a_var(get,  V,  X), 

(set (r(b) ,X) ) :regs, 

(get(X, r(h) ,H) I, 

(set (r(hb} ,H) ]: regs.  ® 

instr (hash (T,R,N,  L) )  — »  hash_type (T) ,  pos(N),  Ibl(L),  j 

reg(get,  R,  X) , 


hash_indirect (T,  X,  Y) , 

[get (L,PC) ] :lblmap, 
hash_loo)cup(PC,  Y,  N)  . 
instr (pair (_,_) )  — >> 

{error ( I ' Attempt  to  execute  inside  a  hash  table.'])). 


%  3.  Arithmetic  instructions: 
instr (add (A, B, V) )  — >>  arith(add. 

A, 

B, 

V)  . 

instr (sub (A, B, V) )  — >> 

arith (sub. 

A, 

B, 

V)  . 

instr (mul(A,B,V) )  — >> 

arith (mul. 

A, 

B, 

V)  . 

instr (div(A,B,V) )  — >> 

arith (div. 

A, 

B, 

V)  . 

instr (mod (A, B, V) )  — >> 

arith (mod. 

A, 

B, 

V)  . 

instr (and (A, B, V) )  — >> 

arith (and. 

A, 

B, 

V)  . 

instr (  or(A,B,V))  — >> 

arith (  or. 

A,- 

B, 

V)  . 

instr (xor(A,B,V) )  — » 

arith (xor. 

A, 

B, 

V)  . 

instr (not (A, V) )  — » 

arith (not. 

A, 

0, 

V)  . 

instr(sll(A,B,V) )  — » 

arith (sll. 

A, 

B, 

V)  . 

instr (sra(A,B,V) )  — » 

arith (sra. 

A, 

B, 

V)  . 

%  4.  Procedural  instructions: 

instr (procedure (N/A) )  — »  (atom(N),  natural(A)). 
instr (call (N/A) )  — >>  {atom(N) ,  natural (A)), 

[get (r (pc) ,PC) ] ; regs, 

[set (r (cp) , PC) ] : regs, 

[set (r (pc) ,N/A) ] :regs. 
instr (return)  — » 

[get (r (cp) ,PC) ] :regs, 

(set (r (pc) ,PC) ] :regs. 

instr  (simple_call  (N/A) )  — »  {atom(N),  naturaKA)), 
(get (r(pc) ,PC) ] :regs, 

(set  (r(tinp_cp)  ,PC)  ]  :rdgs, 

[set (r(pc) ,N/A) ] :regs. 
instr (simple_return)  — » 

[get (r (tmp_cp) ,PC) ] :regs, 

[set (r(pc) ,PC) ] :regs. 
instr (label (L) )  — »  (Ibl(L)). 
instr (jump (L) )  — »  (Ibl(L)}, 

[set (r (pc) ,L) ] : regs. 
instr (allocate (N) )  — » 

(natural  (N) | , 

[get (r (e) ,E) ] :regs, 

{ins(NewE,  r(e),  E)), 

(get (r (cp) ,CP) ] :regs, 

{ins(NewE,  r(cp),  CP)), 

(seaKNewE)  ), 

(set (r (e) ,NewE) ) : regs. 
instr (deallocate (N) )  — » 

(natural (N) ) , 

[get (r(e) ,E) ] :regs, 

(get (E, r (e) ,NewE) ), 

(get (E, r (cp) ,NewCP) ) , 

[set (r (e) ,NewE) ) : regs, 

[set (r (cp) ,NewCP) J :regs. 
instr (nop)  — >>  []. 
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%  5.  Pragma  information  for  translator  and  reorderer: 

%  Pragmas  are  no-ops  in  the  execution, 
instr (pragma (P) )  -->>  (pragma (P)),  !. 

%  6.  Additions  to  BAM  for  the  assembly  language  programmer; 

%  The  meaning  of  these  instructions  depends  on  the  underlying  architecture, 

%  so  they  are  not  included  in  this  specification.  See  the  Implementation 
%  Manual  for  a  discussion  of  their  use. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%  Pragmas  *** 

%  A  variable  is  a  multiple  of  N. 

%  Inserted  just  before  loads  in  readmode  unification, 
pragma (align (V, N) )  a_var{V),  pos (N) . 

%  Inserted  just  before  a  sequence  of  pushes  in  writemode  unification. 

%  (The  pushes- may  be  interleaved  with  non-memory  moves.) 
pragma  (push (term(Size) ) )  pos (Size) . 

pragma (push (cons) ) . 

pragma (push (structure (A) ) )  pos (A) . 
pragma (push (variable) ) . 

%  Specify  the  tag  of  a  variable. 

%  (This  is  useful  for  processors  without  explicit  tag  support.) 
pragma (tag (V,T) )  a_var(V),  a_tag (T) . 

%  Length  of  a  hash  table. 

pragma (hash_length (Len) ) .  pos (Len) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 


%  •**  Tags 


a_tag(tatm) . 

/* 

atom  */ 

a_tag (tint) . 

/* 

integer  */ 

a_tag (tneg) . 

/* 

negative  integer  */ 

a_tag (tpos) . 

/* 

nonnegative  integer  •/ 

a_tag(tstr) . 

1* 

structure  •/ 

a_tag(tlst) . 

f* 

cons  cell  */ 

a_tag (tvar) . 

/• 

variable  */ 

atom_tag (tatm) . 

atomic_tag (tatm) . 
atomic_tag (tint ) . 
atomic_tag(tneg) . 
atomic_tag (tpos)  . 

pointer_tag(tstr)  . 
pointer_tag(tlst) . 
pointer_tag (tvar) . 
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%  •**  Addressing  modes  *** 


%  Both  read  and  write  access: 

heap(get,  W,  X)  --»  |ptr_word  (W)  ) ,  [get  (W, X)  ]  .-heap, 
heap (set,  W,  X)  — >>  {ptr_word (W) ) ,  [set (W, X)] : heap. 

ptr_word(T'_)  pointer_tag(T) . 

reg(get,  R,  X)  — >>  (reg(R)),  (get (R,  X) ] : regs  . 
reg(set,  R,  X)  --»  {reg(R)l,  (set  (R,  X)  ]:  regs  . 

reg(r(I))  int(I),  !- 

hreg(R)  reg(R),  !• 
hreg (r (h) ) . 

perm(get,  P,  X)  — »  (penn(P)),  (get {r(e) , E) ]: regs,  (get (E,P,X) 1 . 
perm(set,  P,  X)  — »  (perm(P)l,  (get (r (e) ,E) ): regs,  (set (E,P,X,NewE) ) , 
(set  <r <e) ,NewE) ] ;regs. 

perm(p(I))  natural(I). 

a  var(WR,  V,  X)  — »  reg(WR,  V,  X),  J. 
a~var(WR,  V,  X)  — »  perm(WR,  V,  X). 

a_var(Reg)  reg(Reg),  !. 

a_var(PerTn)  perm  (Perm)-. 

var_i(WR,  (V],  X)  — »  a_var(get,  V,  W) ,  heap(WR,  W,  X) ,  !. 

var”i(WR,  V,  X)  — »  a__var(WR,  V,  X). 

%  Read  access  only: 

%  An  int  is  its  own  value: 
int(N)  :-  integer (N). 

%  An  atomic  is  its  own  value: 
an_atoinic(I)  :•  int  (I),  !. 

an  atomic (T" A)  atom (A),  atom_tag(T),  !. 

an”atomic (T‘ (F/N) )  :-  atom(F),  pos (N) ,  atom_tag(T). 

arg(Arg,  Arg)  — >>  {an_atomic (Arg) } ,  !. 

arglArg,  X)  — >>  a_var(get,  Arg,  XI. 

arg_i(Arg,  Arg)  --»  (an_atomic  (Arg)  ),  !. 
arg  i(Arg,  X)  — >>  var_i(get,  Arg,  X)  . 

numreg(Arg,  Arg)  — >>  (int (Arg)),  !. 
numreg(Arg,  X)  -->>  reg(get,  Arg,  X). 
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numarglArg,  Arg)  — »  (int(Arg)),  !. 
numarglArg,  X)  — »  var_i(get,  Arg,  X). 

var_of  f  ( (Var+I] ,  X)  — »  a_var(get,  Var,  .T)  ,  !, 

(pos(l)},  add_word(T,  I,  T2) ,  (get  (T2, X)  ]  .-heap. 
var_of  f  { (Var) ,  X)  — >>  a_var(get,  Var,  .T) ,  (get  {T,  X)  ]  :heap. 

%  Creating  immediate  tagged  pointer  objects: 
iram_tag(Tag‘ (r (h) +D} ,  W)  — »  |pointer_tag (Tag) 1 ,  !, 

(get (r (h) ,T) ] ;regs, 

Ipos(D)),  add_word(T,  D,  X), 
insert_tag (Tag,  X,  W) . 

imm_tag (Tag* r (h) ,  W)  — >>  (pointer_tag (Tag) ) ,  !, 

(get  (r  (h)  ,X)  ]  :  regs, 
insert_tag (Tag,  X,  W) . 

%  Effective  address  for  equal: 

ea_e(Var,  X)  — »  a_var(get,  Var,  X),  !. 

ea_e(VarOff,  X), — »  var_of f (VarOf f ,  X). 

%  Effective  address  for  move: 
ea_m(Arg,  X)  — »  arg (Arg,  X),  !. 
ea_m(VarOff,  X)  — »  var_off (VarOf f,  X),  !. 
ea_m(T*,r  (h) ,  X)  -->>  imm_tag<T‘r(h) ,  X). 

%  Effective  address  for  push: 
ea_p(Arg,  X)  — »  arg__i(Arg,  X),  !. 
ea_p(T*Y,  X)  — >>  imm_tag(T*y,  X). 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Miscellaneous 

eq_ne(eq).  /•  Equal  •/ 
eq_ne(ne).  /•  Not  equal  •/ 

cond(lts),  /•  Signed  less  than  '/ 
cond(les) .  /*  Signed  less  than  or  equal  */ 
cond(gts) .  /•  Signed  greater  than  »/ 
cond(ges) .  /•  Signed  greater  than  or  equal  •/ 
cond(eq) .  /*  Equal  */ 

cond(ne).  /*  Not  equal  •/ 

hash_type (atomic) . 
hash_type (structure) . 

Ibl (fail) . 

lbl(N/A)  :-  atom(N),  natural (A). 

Ibl (1 (N/A, I) )  :-  atom(N),  natural(A),  natural(I). 

nv_f lag (nonvar) . 
nv_f lag (var) . 
nv_flag (' ?' ) . 
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%  A  list  of  register  numbers; 

%  (May  contain  the  value  'no'  as  well) 
regs ( I ) ) . 

regs  ( (RISet] )  (int  (R)  ;  R-no) ,  regs^.(Set)  . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Dereference  utilities: 


deref_rtn{X,  X)  --»  (nonvartag  (X)  J ,  !. 

deref_rtn(X,  Y)  — >> 

[get (X,X2) ] :heap, 
deref_rtn_cont (X,  X2,  Y) . 

deref_rtn_cont (X,  X,  Y)  — >>  {Y=X|. 

deref_rtn_cont (_,  X,  Y)  — »  deref_rtn(X,  Y) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Equal  routine: 


equal_rtn(X,  X,  _) 
equal_rtn (_,  L) 


— »  !  . 

— >>  (set (r (pc) , L) ] :regs . 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 


%  Switch  and  test  routines: 


switch_rtn (_,  tvar,  A,_,_)  — »  !,  (set (r (pc) , A) ] : regs . 

switch_rtn (T,  TX,  _»B,_)  — »  {equivalent_tag (T,TX) ) , ! ,  (set (r (pc) ,B) ] :regs . 
switch_rtn (_,  _»_/C)  — »  (set (r (pc) ,C) ] : regs . 

test_rtn(Eq,  T,  TX,  L)  — »  (test_true (Eq,  T,  TX) ) ,  !,  (set (r (pc) , L) ] iregs . 
test_rtn(  _)  — »  (). 

test_true (eq,  T,  TX)  equivalent_tag (T,  TX) . 
test_true (ne,  T,  TX)  :-  \+equivalent_tag (T,  TX) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%«%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Arithmetic  utilities: 


arith(Op,  A,  B,  V)  — » 

numarg(A,  XA) ,  (extract_value  (XA,  VA) ,  checlc_int  (XA)  ) , 
numarg(B,  XB) ,  (extract_value  (XB,  VB) ,  chec)c_int  (XB)  ) , 
arith_operation (Op,  VA,  VB,  VC), 
a  var (set,  V,  VC) . 


arith_operation (add. 

VA, 

VB, 

VC) 

-  VC 

is 

VA+VB. 

arith_operation (sub. 

VA, 

VB, 

VC) 

-  VC 

is 

VA-VB. 

arith_operation (roul. 

VA, 

VB, 

VC) 

-  VC 

is 

VA*VB. 

arith_operation (div. 

VA, 

VB, 

VC) 

-  VC 

is 

VA//VB. 

arith_operation (mod. 

VA, 

VB, 

VC) 

-  VC 

is 

VA  mod  '/B. 

arith_operation (and. 

VA, 

VB, 

VC) 

-  VC 

is 

VA  /■  B. 

atith_operation (  or. 

VA, 

VB, 

VC) 

-  VC 

is 

VA  \/  VB. 

arith_operation (xor. 

VA, 

VB, 

VC) 

-  VC 

is 

(VA  /\  \(VB))  \/  (VB  /\  \(VA)). 

atith_operation (not. 

VA, 

P 

VC) 

-  VC 

is 

\  (VA)  . 

arith_operation (sll. 

VA, 

VB, 

VC) 

-  VC 

is 

VA«VB . 

arith_operation (sra. 

VA, 

VB, 

VC) 

-  VC 

is 

VA»VB. 

%  Conditional  jump: 

jump_cond_rtn (C,  VA,  VB,  L)  — >>  i jump_true (C,  VA,  VB) ) ,  1,  (set (r (pc) , L) 1 : regs . 
jump_cond_rtn  _)  — »  11- 

jump_true (Its,  VA,  VB)  VA0<VB. 

jump_true (gts,  VA,  VB)  VA0>VB. 

juinp_true  (les,  VA,  VB)  VA6=<vb. 

jump_true(ges,  VA,  VB)  VAe>=VB. 

jump_true(  eq,  VA,  VB)  VA“»VB. 

jump_true(  ne,'  VA,  VB)  VA\==VB. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Hash  table  utilities; 

hash_loo)tup (PC,  X,  N)  — » 

(PCI  is  PC+2), 

tget (PCI, pragma (hash^length (N) ) ) ] :code, 

(PC2  is  PCl+1), 

(PCN  is  PCl+N), 
hash_loo)cup_2(PC2,  PCN,  X). 

hash_loo)<up_2  (PC,  PCN,  _)  — »  (POPCN),  !. 
hash_lookup”2 (PC,  PCN,  X)  — »  (PC-<PCN), 

(get  (PC, pair  (E, L) )  ]  -.code, 

(E«=X1, 

•  i 

(set (t (pc) ,L) ] :regs. 

ha sh_loakup_2 (PC,  PCN,  X)  — »  {PC»<PCN), 

(PCI  is  PC+U, 
hash_lookup_2(PCl,  PCN.  X). 

%  Indirection  needed  for  structures  because  main  functor  is  in  memory: 
hash_indirect (atomic,  X,  X)  — >>  (]. 
hash_indirect (structure,  X,  Y)  — »  (get (X,Y) ]: heap. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Choice  point  and  fail  utilities: 

save_choice_regs ( I) ,  _)  — >>  ( I • 
save_choice_regs ( lno)Rs) ,  B)  -->>  !, 
save_choice__regs  (Rs,  B)  . 
save_choice_regs  ( lURs) ,  B)  — » 

(get (r (I) ,R) ] :regs. 
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{ ins  (B,  r  (I)  ,  R)  ) , 
save_choice_regs (Rs,  B) . 

restore_choice_regs ( [ ] ,  _)  — »  (]. 

restore_choice_regs { (no  I Rs] ,  B)  — >>  !, 
restore_choice_regs (Rs,  B) . 

restore_choice_regs ( ( 1 1 Rs] ,  B)  — >> 

(get(B,  r(I) ,  R) I , 

Iset<r(I) ,R) ] :regs, 
restore_choice_regs (Rs,  B) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Trailing  and  detrailing: 

trail_rtn(X)  — » 

(get (r (hb) ,HB) ] :regs, 
cmp^trail (X,  HB) . 

« 

cinp_trail(X,  BB)  — »  ( less_trail  (X,  HB)  ),  5, 

(get (r(tr) ,TR) ] tregs, 

(set  (TR,  X)  ]  .'trail, 

(TRl  is  TR+1), 

(se.t  (r  (tr)  ,  TRl)  ]  :  regs  . 

cmp_trail(_,  _)  — »  (]. 

less_trail(_*X,  _*y)  X<y. 

%  Restore  to  unbound  the  variables  on  the  trail  between  OldTR  and  CurTR. 

detrail_rtn(CurTR,  OldTR)  — »  (CurTR-<01dTR) ,  !. 

detrail_rtn (CurTR,  OldTRl  — »  {CurTR>01dTR) , 

(CurTRl  is  CurTR-1), 

(get (CurTRl, V) ] : trail, 

(set (V,V) ] :heap, 
detrail_rtn (CurTRl,  OldTR). 

%%%%%%%%%%%%%«%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  General  unification  routine: 

unify_rtn (Wl,  W2,  L)  — » 

unify_rtn_2 (Wl,  W2,  Flag), 
unify_end(Flag,  L) . 

uni fy_end (success,  _)  — »  (]. 

%  For  later:  detrailing  if  L\fail. 

unify_end(fail,  L)  — >>  (set (r (pc) ,L) ] :regs. 

unify_rtn_2 (Wl,  W2,  Flag)  — » 

(extract_tag_value (Wl,  Tl,  VI)), 

(extract_tag_value(W2,  T2,  V2)), 
unify_rtn_2 (Tl,  VI,  T2,  V2,  Flag). 

unif y_rtn_2 (tvar,  VI,  NTag,  V2,  success)  — »  (NTag\*«tvar ) ,  !, 


trail_rtn (tvar'Vl) , 

(inake_word (NTag,  V2,  Word)), 

[set (tvar'Vl, Word) ] :heap. 

unify_rtn_2  (NTag,  V2,  tvar,  VI,  success). — »  {NTag\*=-tvar ) ,  !, 
trail_rtn (tvar'Vl) , 

(ma)te_word (NTag,  V2,  Word)), 

(set (tvar'Vl, Word) ] :heap. 
unify__rtn_2  (tvar,  VI,  tvar,  V2,  success)  — »  !, 
unify_varvar (VI ,  V2) . 

%  Matching  atomic  tags: 

unify_rtn_2 (ATag,  VI,  BTag,  V2,  Flag)  — » 

(atomic_tag (ATag) ), 

(atomic_tag (BTag) ), 

(equivalent_tag (ATag,  BTag)),  • 

I 

•  / 

unify_atin(Vl,  V2,  Flag)  . 

%  Non-matching  nonvariable  tags: 
unify_rtn_2 (ATag,  _,  BTag,  _,  fail)  — » 

{ATag\«“tvar,  BTag\»«tvar ) , 

{\+equivalent_tag(ATag,  BTag)), 

I 

%  Matching  pointer  tags  (recursive  case) : 
unify_rtn_2 (ATag,  VI,  ATag,  V2,  Flag)  — » 

{pointer_tag (ATag) ) , 
get_size (ATag,  VI,  Sz) , 

unify_rtn_args_2 (0,  Sz,  ATag,  VI,  V2,  Flag). 

%  The  term's  Size  is  the  maximum  offset  needed  to  traverse  the  term  in  memory. 
get_size (tlst,  1)  — >>  (]. 
get_size (tstr,  V,  N)  — » 

(get  (tstr'V,  Func)  ]  .'heap, 

(Func“ (tatm* (_/N) ) ) . 

unify_rtn_args_2 (N,  Sz,  _,  _,  _,  success)  — »  (N>Sz),  !. 
unify_rtn_args_2 (N,  Sz,  T,  V,  W,  Flag)  — »  (N*<Sz),  !, 

[VN  is  V+N), 

(WN  is  W+N), 

(get (T'VN, VX) ] :heap,  deref_rtn(VX,  DVX) , 

(get (T'WN,WX) 1 : heap,  dere£_rtn (WX,  DWX) , 
unify_rtn_2 (DVX,  DWX,  F) , 

(N1  is  N+ll, 

unify_rtn_args_3 (F,  Nl,  Sz,  T,  V,  w.  Flag). 

%  Continue  with  other  arguments  if  argument  unification  succeeded: 
unify_rtn_args_3 (fail,  _,  fail)  — »  (). 

unify_rtn_args_3  (success,  Nl,  Sz,  T,  V,  W,  Flag)  — » 
unify_rtn_args_2 (Nl,  Sz,  T,  V,  W,  Flag). 

%  Unifying  value  parts  of  two  atomic  terms  with  equivalent  tag: 
unify_atm(V,  success)  — »  !. 
unify_atm(_,  _,  fail)  — >>  (J . 

%  Unifying  two  variables:  bind  youngest  to  oldest,  trail  youngest. 
unify_varvar (Vl,  V2)  — »  {V1>V2),  !, 
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trail_rtn (tvar'Vl) , 

(set  (tvar'Vl,  tvar'V2)  J  ‘.heap. 
unify_varvar (VI,  V2»  — »  {V1-<V2),  !, 
trail_rtn (tvar'V2) , 

(set (tvar'V2, tvar'Vl) ] ;heap. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Simple  type  utilities; 

grounci(X)  nonvar(X),  functor(X,  _,  N) ,  ground(N,  X). 
ground(N,  _)  N«:»0,  . 

ground(N,  X)  N«\**0,  arg(N,  X,  A)  ^  ground(A) ,  N1  is  N-1,  ground(Nl,  X). 

natural(N)  integer (N),  N>*0. 
pos(N)  integer (N),  N>0. 

%%«%%%%%%%%%%%^%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Word,  tag,  and  value  manipulation  utilities; 

%  This  takes  into  account  the  relationship  between  tpos,  tneg  and  tint. 

%  For  ‘integers  it  extracts  tpos  or  tneg  tags  and  the  absolute  value 
%  of  the  integer.  It  creates  the  correct  integer,  given  the  tpos,  tneg 
%  or  tint  tags. 

equivalent_tag (T,  T)  I. 

equivalent_tag(tint,  tpos)  !. 

equivalent_tag (tint,  tneg). 

extract_tag (N,  tpos)  integer (N),  N>«0,  !. 
extract_tag (N,  tneg)  integer (N),  H<0,  !. 
extract_tag (T'_,  T) . 

extract_value (N,  N)  int(N),  N>»0,  !. 

extract_value (N,  M)  int (N) ,  N<0,  !,  M  is  -N. 

extract_value (_'V,  V) . 

extract_tag_value (W,  T,  V)  :« 
extract__tag(W,  T) , 
extract_value (W,  V) . 

nonvartag(l)  int(l),  !. 
nonvartag (T*_)  \+T-tvar. 

%  Only  used  for  pointer  tags: 
insert_tag (T,  _'V,  T'V) . 

make^word(tint,  I,  I) 
make___word (tpos,  I,  I) 
maKe__word(tneg,  N,  I)  I  is  -N. 

make  word(T,  V,  T'V) . 
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add_worcl(T‘I,  J,  T'K)  K  is  I+J. 

%  Eventually,  print  out  value  of  PC: 
check_int(I)  int(I),  !. 
check_int (_) 

error ( I' Operand  of  conditional  is  not. an  integer.')). 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
%  Table  utilities: 

%  This  code  implements  a  mutable  array,  represented  as  a  binary  tree. 

%  Insert  a  value  in  logarithmic  time  and  constant  space: 

%  This  predicate  is  used  in  this  program  only  to  create  the  array, 

%  although  it  can  also  be  used  to  access  array  elements. 
ins(T,  I,  V)  :-  hashd,  H)  ,  ins_2  (T,  H,  V). 

ins_2(node(N,W,L,R) ,  I,  V)  ins_2  (N,  W,  L,  R,  I,  V). 

ins_2(N,  V,  I,  V)  I=N,  !. 

ins_2(N,  _,  L,  R,  I,  V) 

compare (Order,  I,  N) , 
ins_2  (Order,  I,  V,  L,  R)  . 

ins_2(<,  1,  V,  L,  _)  ins_2 (L,  I,  V). 

ins_2(>,  I,  V,  _,  R)  ins_2 (R,  1,  V). 

%  Access  a  value  in  logarithmic  time  and  constant  space: 

%  This  predicate  cannot  be  used  to  create  the  array  incrementally, 

%  but  it  is  faster  than  ins/3. 

get(T,  1,  V)  hashd,  H) ,  get_2  (T,  H,  V). 

get_2  (node  (N,  w,  L,  R) ,  I,  V) 

compare (Order,  1,  N) , 
get_3  (Order,  I,  V,  W,  L,  R)  . 

get_3(<,  I,  V,  _,  L,  _)  get_2  (L,  I,  V). 

get_3(-,  V,  W,  _,  _)  V-W. 

get_3(>,  I,  V,  _,  _,  R)  get_2  (R,  I,  V). 

%  Update  an  array  in  logarithmic  time  and  space: 
set(T,  I,  V,  O)  hashd,  H) ,  set_2  (T,  H,  V,  U)  , 

set_2(leaf,  I,  V,  node (I, V, leaf , leaf )) . 

set_2(node(N,W,L,R),  1,  V,  node  (N,NW,NL,NR) ) 
compare (Order,  I,  N), 

set  3 (Order,  I,  V,  W,  L,  R,  NW,  NL,  NR). 


set 

3«, 

I, 

V, 

W, 

L, 

R, 

w. 

NL, 

R) 

set_2 (L, 

1, 

V. 

NL)  . 

set 

3(-, 

V, 

L, 

R, 

V, 

L, 

R)  . 

set 

30, 

I, 

V, 

W, 

L, 

R, 

W, 

L, 

NR) 

set_2(R, 

I, 

V, 

NR)  . 

1 


i 


I 


I 


# 


%  Prevent  any  further  insertions  in  the  array: 
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seal (leaf)  . 

seal (node L, R) )  seal(L)«  seal(R). 

%  Print  values  of  array  in  sorted  order: 
print_array (Term) 

f lat_array (Term,  2,  Flat), 
print_list (Flat) . 

print_list  ( ( ) )  - 
print_list ( I (A->B) 1 L] ) 

write (A) ,  put(9),  write('=  '),  write(B),  nl, 
print_list (L) . 

f lat_array (Term,  N,  Sort) 

N>0,  Nl  is  N-1, 

f lat_array (Term,  Nl,  Flat,  (]),  !, 
sort (Flat,  Sort) . 

£lat_array (leaf ,  N,  11)  N=:“0, 

flat_array (nodfe ,  N,  N-:“0, 

f lat_array  (Term,  Term)  . 

f lat_array (leaf ,  _)  — >  (]  . 
flat_array(node(H,T,L,R) ,  N)  — > 

£la't_array  (L,  N) , 

(hash(H,  I)), 

(flat_array (T,  N,  F)), 

((I->F)], 
f lat_array (R,  N) . 

%  Invertible  hash  function: 

%  Bit  inversion  of  the  integer  components  of  a  ground  term.  Other  parts  are 
%  unchanged.  This  one  inverts  the  low  16  bits.  It  can  be  changed  by  changing 
%  the  last  argument  of  bit_invert/3. 
hashd,  H)  integer  (I),  !,  bit_invert  (I,  H,  16). 

hash(T,  H)  functor (T,  Na,  Ar) ,  functor (H,  Na,  Ar) ,  hash_2 (Ar,  T,  H)  . 


hash_2(0,  _,  _) 
hash_2(N,  T,  H)  :-  N>0, 
arg(N,  T,  X), 
arg(N,  H,  Y) , 
hash(X,  Y), 

Nl  is  N-1, 
hash  2 (Nl,  T,  H) . 


bit_invert (0,  0, 
bit_invert (N,  I, 
L  is  N»l, 

R  is  N/\l, 

B1  is  B-1, 
bit_invert (L, 
1  is  R*(1«B) 


_)  '. . 

B)  :-  N>0, 


LI,  Bl), 
+  LI. 


%%%%%«%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%«%%%%%%%% 
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%  Error  handling: 
error (L) 

write('***  Error:  '), 
error_loop (L) , 
write ('***'),  nl . 

error_loop (11). 

error_loop ( IM I L] )  :-  write (M),  error_loop (L) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

%  Primitive  version  of  write: 

write_rtn  — >> 

(get (r (0) ,X) ] :regs, 
write_rtn{X) . 

write_rtn (tvar'V)  — »  !,  (write('_'),  write(V)). 

write_rtn(l)  — >>  (int(I)l,  !,  {write(I)j. 

write_rtn (tatm*  (F/N) )  — »  !,  {write(""),  write  (F/N) ,  write 
write_rtn (tatm*A)  — »  (write(A)). 
write_rtn (tlst'V)  — »  !, 

(W  .is  V+1), 

(get (tlst*V, Head) ] :heap, 

(get  (tlst* W,Tail)  ]  '.heap, 
deref_rtn (Head,  DHead) , 
deref_rtn (Tail,  DTail) , 

(write (' (') ), 
write_rtn (DHead) , 

(write (' r ) ) , 
write_rtn (OTail) , 

(write(']') I . 
write_rtn (tstr'V)  — »  J, 

(get (tstr'V, tatm* (F/N) ) ] :heap, 

(write (F),  write('(')l, 
write_arg(V,  1), 
write_args (2,  N,  V), 

(write(')') ). 

write_args(I,  N,  _)  — »  (I>N),  !. 
write_args (1,  N,  V)  — »  (I«<N),  !, 

(H  is  I+l), 

|write(',') ), 
write_arg(V,  I), 
write_args (II,  N,  V) . 

write_arg(V,  l)  — » 

(W  is  V+IJ, 

(get  (tstr'W,  X)  ]  -.heap, 
deref_rtn(X,  DX), 
write  rtn(DX) . 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
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Appendix  D 

Semantics  of  the  Berkeley  Abstract  Machine 


1.  Introduction 

This  appendix  gives  an  English-language  description  of  the  semantics  of  the  Berkeley  Abstract 
Machine  (BAM)  as  comments  attached  to  a  (Vblog  q)ecification  of  its  syntax.  The  BAM  is  intended  to 
operate  on  the  same  data  struaures  as  the  Warren  Abstraa  Machine  (WAM).  therefore  some  familiarity 
with  the  WAM  is  an  advantage.  The  semantics  are  represented  by  short  descriptions  supplemented  by 
pseudo-code  and  examples  where  necessary. 

The  BAM.  is  designed  to  be  simple  and  easily  translated  to  most  geiwral-puipose  processors.  Many 
of  its  optimizations  apply  to  any  processor,  for  example  the  streamlined  choice  point  management  and  the 
use  of  write-once  permanent  variables  to  simplify  trailing.  Although  the  first  target  is  the  VLSl-BAM  pro¬ 
cessor.  we  have  built  translators  for  other  processors  including  the  MIPS  and  the  MC68020.  Pragmas  give 
information  that  is  used  to  obtain  the  best  translation  for  different  processors. 

'  The  instruction  set  is  divided  in  six  categories,  each  in  a  different  section.  Each  section  starts  with  a 
box  giving  the  syntax  of  the  instructions  presented  in  that  section.  This  is  followed  a  desotqnion  of  the 
instructions’  actions.  Section  2  gives  the  unification  instructions.  Section  3  gives  the  conditional  control 
flow  instructions.  Section  4  gives  the  arithmetic  instructions.  Section  5  gives  the  procedural  control  flow 
instructions.  Section  6  gives  the  pragmas,  which  contain  infonnation  that  allows  better  translation.  Section 
7  gives  the  user  instructions,  additions  to  the  BAM  that  are  never  output  by  the  compiler  but  ate  intended 
for  the  BAM  assembly  programmer.  The  last  section  defines  the  syntax  and  semantics  of  the  addressing 
tiKxlcs  used  in  the  instructions. ' 

In  explaining  the  semantics,  a  few  assumptions  are  made  about  the  data  representation.  An  infinite 
number  of  registers  is  assumed;  the  translator  should  map  registers  of  sufficiently  large  index  to  memory. 
A  tagged  architecture  is  assumed;  ix.  each  word  contains  a  tag  and  a  value  field  which  are  treated  as 
separate  entities  in  some  instructions  and  as  a  unit  in  other  instructions.  A  load-store  architecuire  is 
assumed;  almost  any  architecture  has  a  subset  of  instructions  that  satisfy  this  assumption.  The  actual 
details  of  the  translation  to  the  larga  architecture  are  rnx  given  since  they  depend  on  the  characteristics  of 
the  architecture.  These  charaaeristics  include  the  number  of  registers,  the  addressing  modes,  hardvrare 
support  for  certain  features  (tagging,  dereferencing,  trailing,  etc.),  the  precise  foniua  of  choice  points  and 
environments,  and  so  forth. 


2.  Unification  instructions 


Unification  syntax 

insert deref (V,W) ) 

-  var_i (V) ,  var  i (W) . 

instr (equal (EA, A, L) ) 

-  ea_e(EA),  arg_i (A)  ,  Ibl(L). 

instr (unify (V,W,F,G, L) ) 

-  var_i (V) , var_i (W) , nv_f lag (F) , nv_f lag (G)  ,  Ibl (L) . 

instr (trail (V) ) 

-  var_i (V) . 

instr (move (EA, VI) ) 

-  ea_m(EA),  var_i(VI). 

instr (push (EA, R, N) ) 

-  ea_p(EA),  hreg{R),  pos (N) . 

instr (adda (R, S,T) ) 

-  numregtR),  numreg(S),  hreg(T). 

instr (pad (N) ) 

-  pos (N) . 

instr (unify_atomic (V, I, L) ) 

-  var_i(V),  an_atomic ( I) ,  Ibl(L). 

instr (fail) . 

deref  (V,  W)  Dereference  the  argument  V  and  store  the  result  in  W.  The  argument 

V  is  unchanged.  This  is  the  only  instruction  which  dereferences  its 
argument  All  other  instructions  assume  that  their  arguments  are 
dereferenced.  Giving  the  dereference  instruction  two  arguments 
simplifies  the  implementation  of  wiite-once  permanent  variables  and 
makes  a  fast  implementation  of  trailing  possible. 

equal(X,Y,L)  Compare  X  to  Y  and  branch  to  L  if  they  arc  not  equal.  The  comparison 

is  a  full  word  operation,  eqa^  valent  to  “eq”  in  Lisp.  It  is  assumed  that 
X  and  Y  are  dcrcferenccu. 

unify  (X,  'i,  T,  U,  L)  Perform  a  general  unification  of  X  and  Y,  and  branch  to  L  if  it  fails. 

Alv’ays  binds  oldest  variables  to  the  youngest  In  the  failure  case  all 
bindings  are  undone.  It  is  assumed  that  X  and  Y  are  dereferenced.  The 
two  parameters  T  and  U  are  added  as  an  optimization,  and  may  be 
safely  ignored.  They  are  flags  (with  values  '  ,  var,  or  nonvar) 

that  say  whether  it  is  known  if  X  and  Y  ate  variables  or  nonvariables. 

'  With  this  information  a  better  translation  to  the  target  processor  can  be 
done. 

trail  (X)  Push  the  address  of  X  on  the  trail  stack  if  the  trail  condition  X<r  (hb) 

is  satisfied.  It  is  assumed  that  X  is  a  dereferenced  unbound  variable, 
i.e.  ithasa  tvar  tag.  Only  one  comparison  is  necessary  for  the  uail 
check.  The  state  register  r  (hb)  points  to  the  heap  location  which 
was  the  top  of  the  heap  when  the  most  recent  choice  point  was  created. 

move  (X,  Y)  Move  X  to  Y.  Depending  on  the  addressing  mode,  this  instruction  docs 

a  load  or  store  or  creates  a  tagged  value. 

push  (X,  R,  N)  Push  X  on  the  stack  with  stack  pointer  R,  then  increment  R  by  N.  This 

instruction  is  used  for  write  mode  unification. 

adda  (X,  Y,  R)  Add  X  and  Y  into  R.  This  is  a  full  word  operation  which  never  traps, 

unlike  the  arithmetic  insuuctions  in  section  4.  This  instruction  is  used 
to  allocate  space  for  uninitialized  variables.  The  second  argument  Y  is 
an  offset  which  is  sealed  properly  by  the  translator  (i.c.  it  is  unchanged 
for  the  VLSI-BAM  since  it  is  word-addressed,  and  it  is  multiplied  by  4 
for  the  MIPS,  since  it  is  byte-addressed). 


pad  (N)  Add  N  words  to  the  heap  pointer  r  (h) .  This  is  a  full  word  operation 

which  never  traps,  unlike  the  arithmetic  instructions  in  section  4.  It  is 
used  to  ensure  the  correct  alignment  of  compound  terms.  The  space 
reserved  by  pad  will  never  be  stored  to.  If  the  increment  is  a  multiple 
of  the  alignment  then  the  pad  disappears.  The  increment  is  scaled 
properly  by  the  translator  (see  previous  description  of  adda), 

unif  y_atomic  (X,  Y,  L)  Unify  the  variable  X  with  the  atomic  term  Y,  and  branch  to  L  if  it  fails. 

It  is  assumed  that  X  is  dereferenced.  The  unif  y_atortiic  instruc¬ 
tion  is  a  special  case  of  general  unification  that  is  added  to  reduce  code 
size  in  the  VLSI-BAM  processor.  There  is  a  compiler  option  to  enable 
or  disable  the  generation  of  this  instruction. 

fail  Untrail  all  variable  bindings  and  jump  to  the  retry  address.  Do  not 

restore  argument  registers.  Argument  registers  are  restored  by  the 
choice  point  management  instructions. 


3.  Conditional  control  flow  instructions 


'  Clause  selection  syntax 

instr(switch(T,V,A,B,C) ) 
instr (choice (I/N,Rs,  L) ) 
instr (test (Eq, T, V,  L) ) 
instr( jump(C,A,B,  L) ) 
instr (move (CH, V) ) 
instr (cut (V) ) 
instr (hash (T,R,N,  L) ) 
instr(pair (E,L) ) 

-  a_tag(T),  var_i(V),  Ibl (A) , Ibl (B) , Ibl (C) . 

-  pos(I),  pos(N),  I=<N,  Ibl(L),  regs(Rs). 

-  eq_ne(Eq),  var_i(V),  a_tag(T),  Ibl(L). 

-  cond(C),  nuinarg_i  (A) ,  nuinarg_i  (B) ,  Ibl  (L)  . 

-  a_var(V),  choice_ptr (CH) . 

-  a_var(V) . 

-  hash_type (T) ,  reg(R),  pos(N),  Ibl(L). 

-  an_atomic (E) ,  Ibl(L). 

switch  (T,  R,  A,  B,  C)  A  three-way  branch:  branch  to  the  label  A,  B,  C  depending  on  whether 

.  the  lag  of  R  is  tvar,  T.  or  any  other  value.  The  label  fail  is  not 
an  address,  but  denotes  a  branch  to  the  global  failure  routine.  It  is 
assumed  that  R  is  dereferenced. 

choice  (I/N, RS,  L)  The  choice  point  management  instruction  for  choosing  clause  I  out  of 

N  clauses.  Choice  points  are  of  variable  size.  The  semantics  of  choice 
depends  on  I  as  follows: 

1=1  Create  a  choice  point  with  reuy  address  L.  Save  in  it  the 
registers  listed  in  RS. 

1<I<N  Restore  the  registers  mentioned  in  RS  from  the  choice  point, 
ignoring  no  terms.  The  no  terms  make  it  possible  to 
know  the  position  of  the  registers  in  the  choice  point  without 
an  explicit  size  6eld  in  the  choice  point.  Update  the  retry 
address  to  L. 

I=N  Restore  the  registers  mentioned  in  RS,  ignoring  no  terms. 

Remove  the  choice  point.  (L  will  always  be  fail  when 
1=N.) 
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The  above  notation  is  consistent  with  three  possible  implementations 
(in  order  of  decreasing  efficiency):  (1)  The  implementation  given 
above,  in  which  only  those  legisieis  listed  in  RS  are  saved  and  restored, 
and  the  choice  point  does  not  have  a  size  field.  Restoring  registers  is 
done  by  the  choice  instructions,  not  by  the  fail  instruction.  The  com¬ 
piler  does  an  effort  to  minimize  the  set  of  registers  mentioned  in  RS. 

(2)  Saving  all  registers  up  to  the  maximum  register  listed  in  RS.  In  this 
case  the  choice  points  arc  of  variable  size,  and  the  no  terms  in  RS  ate 
ignored.  The  notation  is  consistent  with  choice  points  containing  a  size 
field.  (3)  Always  saving  and  restoring  all  registers.  In  this  case  the 
choice  points  are  of  fixed  size,  the  RS  argument  is  ignored,  and  the  fail 
instruction  restores  the  registers.  In  this  case  the  semantics  correspond 
to  the  try,  retry',  and  trust  instructions  of  the  WAM. 

test  (E,  T,  X,  L)  Branch  to  label  L  if  the  tag  of  X  is  equal/noi  equal  to  T. 

Equality/nonequality  is  selected  by  the  value  of  E.  The  label  fail  is 
not  an  address,  but  denotes  a  branch  to  the  global  failure  routine.  It  is 
assumed  that  X  is  dereferenced. 

j  ump  ( C ,  X ,  Y ,  L )  Compare  X  and  Y  and  jump  to  L  if  the  comparison  is  true.  The  kind  of 

comparison  is  given  by  C.  This  instruction  traps  if  either  argument  is 
not  an  integer.  The  label  fail  is  not  an  address,  but  denotes  a 
branch  to  the  global  failure  routine. 

cut(X)  Implement  the  cut  operation.  Move  X  into  the  r(b)  register;  also 

move  the  value  of  r  <h)  in  this  choice  point  into  the  r  (hb)  regis¬ 
ter.  The  latter  move  is  an  optimization  that  reduces  the  number  of 
trailed  variables,  but  is  not  needed  for  correctness.  The  compiler 
ensures  that  X  contains  a  pointer  to  the  choice  point  which  was  most 
recent  when  the  current  predicate  was  entered. 

hash  (T,  R,  N,  L)  Ijook  up  register  R  in  a  hash  table  located  at  label  L.  The  hash  table 

contains  atomic  terms  (when  T=atomic)  or  the  main  functors  of 
structures  (when  T=structure).  If  R  is  not  in  the  hash  table,  then 
execution  falls  through  to  the  next  instruction.  Otherwise  execution 
continues  at  the  label  conimned  in  the  hash  tabic.  When 
T=structure  the  compiler  guaianiees  that  R  points  to  a  structure. 

The  following  is  an  example  of  hash  uble  code: 

hash (Type, Reg, N, Lbl > .  ;  Hash  Reg  inio  table  at  Lbl 
...  ;  Fall  through  if  not  present 


label  (Lbl).  The  hash  table 

hash_length (N) .  ;  Length  of  the  hash  table 

pair (El, LI).  ;  N  entries 

pair (E2, L2) . 

pair (Ei, Li).  ;  Jump  to  Li  if  Reg  =  Ei 

pair (EN, LN) . 

pair  (E,  L)  A  hash  table  entry.  E  is  either  an  atom  or  the  main  functor  of  a  suuc- 

turc.  The  label  L  is  the  addrcs.s  where  execution  continues  if  the  sup¬ 
plied  value  matches  E. 
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4.  Arithmetic  instructions 


Arithmetic  syntax  I 

instr (add(A,B,V) ) 

-  nun»arg_i  (A) ,  nuinarg_i  (B) ,  a_var  (V)  . 

instr (sub(A,B,V) ) 

-  nuniarg_i  (A) ,  numarg_i  (B) ,  a_var(V). 

instr (mul (A,  B,  V) ) 

-  nuinarg_i  (A) ,  nuinarg_i  (B) ,  a_var(V). 

instr (div (A,  B,  V) ) 

-  nuTnarg_i  (A) ,  numarg_i  (B) ,  a_var  (V)  . 

instr (and (A,B,  V) ) 

-  nuinarg_i  (A) ,  nuinarg_i  (B) ,  a_var  (V)  . 

instr (  or(A,B,V)) 

-  nujnarg_i  (A) ,  nuinarg_i(B),  a_var(V). 

instr (xor(A,B,V) ) 

-  numarg_i  (A) ,  nuinarg_i  (B) ,  a_var  (V)  . 

instr (not (A, V) ) 

-  nuinarg_i  (A) ,  a_var(V). 

instr (sll (A,B,V) ) 

-  nun»arg_i  (A) ,  numarg_i  (B) ,  a_var(V). 

instr (sra (A, B, V) ) 

-  nuniarg_i  (A) ,  nuinarg_i  (B) ,  a_var(V)  . 

All  arithmetic  instmaions  assume  that  their  operands  are  dereferenced  and  destructively  overwrite 
the  result  register.  All  perform  (iterations  on  integers  with  coirea  tag  and  return  a  result  with  correct  tag, 
trapping  if  either  operand  or  the  result  is  n(X  a  int^er.  Arithmetic  semantics  are: 


add(X,Y,2) 

Z<-X+Y 

sub(X,Y,zi 

Z<-X-Y 

mul (X,y,2) 

Z«-X*Y 

div(X, Y,2) 

Z«-X/Y 

and(X,Y,2) 

Z  «-  X  and  Y  (bitwise  and) 

or(X,Y,2) 

Z  4-  X  or  Y  (bitwise  or) 

xor (X,Y, 2) 

Z  «-  X  xor  Y  (bitwise  exclusive  or) 

sll(X,Y,Z) 

Z  X  «  Y  (logical  shift  of  X  left  Y  places) 

sra  (X,Y,2) 

Z*-X»Y  (arithmetic  shift  of  X  right  Y  places) 

not (X,Z) 

Z  4-  not  X  (bitwise  invert  X  into  Z) 

5.  Procedural  control  flow  instructions 


Procedural  syntax 

instr (procedure (N/A) ) 

atom(N) , 

natural (A) . 

instr (call (N/A) ) 
instr (return) . 

atom(N), 

natural (A) . 

instr (simple_call (N/A) ) 
instr  (sin?>le_return)  . 

atom(N) , 

natural (A) . 

instr (label (L) ) 

Ibl(L) . 

instr  (junp(L) ) 

Ibl(L) . 

instr (allocate (Perms) ) 

natural (Perms) . 

instr (deallocate (Perms) ) 

natural (Perms) . 

procedure  (P )  The  entry  point  of  procedure  P. 

call  (N/A)  Call  the  procedure  N/A,  assuming  a  fixed  location  for  the  arguments. 

The  arguments  of  N/A  are  sequentially  loaded  into  argument  regis¬ 
ters.  By  default  (he  registers  used  arc  numbered  from  zero.  i.e.  r  ( 0 ) . 
r  (1) . ...  This  call  is  used  for  all  user-defined  predicates.  It  may  be 
nested,  but  must  be  surrounded  by  an  allocate-deallocate  pair  when 
used  in  the  body  of  a  predicate. 


return 


Return  from  a  call. 
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simple_call (N/A) 


s imple_ret  urn 
label (L) 


jump (L) 


allocate (N) 


deallocate (N) 


Simple  call  of  the  procedure  N/A.  assuming  the  same  argument  pass¬ 
ing  as  call  (N/A).  This  is  a  one-level  call;  it  may  not  be  nested.  It 
docs  not  require  a  surrounding  allocaie-deailocate  pair,  it  can  be 
implemented  by  saving  the  return  address  in  a  fixed  register.  This 
instruction  is  useful  for  interlacing  with  assembly  routines. 

Return  from  a  simple  call. 

Denotes  a  branch  destination.  The  label  fail  is  not  an  address,  bui 
denotes  a  branch  to  the  global  failure  routine. 

Jump  unconditionally  to  label  L.  The  label  may  be  to  the  first  instruc¬ 
tion  of  another  procedure  N/A  or  it  may  be  internal  to  the  current  pro¬ 
cedure.  The  label  fail  is  not  an  address,  but  denotes  a  branch  to  the 
global  failure  routine. 

Create  an  environment  of  size  N  on  the  local  stack.  i.e.  a  new  set  of  N 
permanent  variables  which  arc  denoted  by  p  ( I ) .  Typically,  the  only 
state  registers  stored  in  the  environment  arc  r(e)  and  c(cp).Thc 
environment  must  NOT  contain  the  r  (b)  register. 

Remove  the  top-most  environment  (which  is  of  size  N)  from  the  local 
stack. 


6.  Pragmas 


align (V,N) 

At  this  point  the  contents  of  register  or  permanent  V  arc  a  multiple  of 

N.  This  information  helps  the  reordering  sugc  to  generate  double¬ 
word  load  instructions  for  the  VLSI -BAM  processor. 

hash_length (N) 

N  is  the  length  of  the  hash  table  suning  at  (his  point. 

• 

push (term  IS) ) 

At  this  point  a  block  of  push  instructions  is  about  to  create  a  term  of 
size  S  on  the  heap. 

push (cons) 

At  this  point  a  cons  cell  (of  size  two  words)  is  about  to  be  created  on 
(he  heap.  This  information  helps  the  reordering  stage  to  generate 
double-word  push  instructions  for  the  VLSI-BAM  processor. 

• 

push (structure (A) ) 

At  this  point  a  structure  of  arity  A  is  about  to  be  created  on  the  heap. 

This  information  helps  the  reordering  stage  to  generate  double-word 
push  instructions  for  the  BAM  processor. 

push (variable) 

At  this  point  an  unbound,  initialized  variable  is  about  to  be  created  on 
the  heap. 

• 

hash_length (N) 

This  is  the  start  of  a  hash  ublc  of  length  N. 

Pragma  syntax 


instr (pragma (Pragma) )  pragma (Pragma) . 


pragma (align (V, N) ) 
pragma (push (term(Size) ) ) 
pragma (push (cons) ) . 
pragma (push (structure (A) ) ) 
pragma (push (variable) ) . 
pragma (tag (V, T) ) 
pragma (hash_length (Len) ) 


a_var(V),  pos(N). 
pos (Size) . 

pos (A) . 

a_var (V) ,  a_tag (T) 


tag  (V,  T)  The  contents  of  variable  V  have  tag  T.  This  pragma  precedes  a  load  or 

a  store  with  address  V.  h  is  used  to  make  loads  and  stores  efficiem  for 
processors  which  do  not  have  explicit  tag  support 

7.  User  instructions 

This  section  describes  the  pans  of  the  BAM  language  that  are  never  output  by  the  compiler,  but  only 
used  by  the  BAM  assembly  programmer.  This  is  used  to  write  the  run-dme  system  in  BAM  code,  so  that  it 
is  as  portable  as  possible.  Additional  instructions  are  jump  to  register  address,  creating  and  decomposing 
tagged  words,  non-trapping  full-word  arithmetic,  non-trapping  full-word  unsigned  comparison,  and  trailing 
for  backuackabic  destructive  assignment  Additional  registers  arc  used  in  implementing  the  run-dme  sys¬ 
tem,  and  can  be  mapped  to  memory'  locations. 


jump_ceg  (R)  Jump  unconditionally  to  the  address  stored  in  register  R. 

jump_nt  (C,  A,  B,  L)  Compare  A  and  B  and  jump  to  L  if  the  comparison  is  true.  The  kind  of 

comparison  is  given  by  C.  This  instruction  does  a  full  word  com¬ 
parison  and  never  traps.  The  label  fail  is  not  an  address,  but 
denotes  a  branch  to  the  global  failure  routine. 

ord  (A,  B)  Store  in  B  the  machine  integer  that  corresponds  to  the  atom  or  integer 

in  A.  This  function  strips  the  tag  from  A,  and  therefore  depends  on  the 
target  machine  and  the  program  that  is  compiled.  It  is  used  to  ctmven 
atoms  and  integers  into  uble  indices. 

val  (T,  A,V)  Create  a  ugged  word  in  B  by  combining  the  tag  T  and  the  machine 

integer  in  A.  This  function  is  the  inverse  of  ord(A,B):  In  the 
sequence  ord(Al,B),  val(T,B,A2)  the  argument  A2  will 
receive  an  identical  value  to  A1  if  T  is  the  tag  of  Al. 
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add_nt  (A,  B,  V)  These  arithmetic  instnictions  destructively  overwrite  the  result  register 

sub_nt  (A,B,V)  All  perform  operations  on  full  words,  return  a  full  word,  and  never 

and_nt  (A,  B,  V)  (rap.  See  the  previous  section  on  arithmetic  for  a  description  of  the 

o  r_nt  ( A ,  B ,  V )  operations  performed. 

xor_nt  (A,  B,  V) 
not_nt  (A,  B,  V) 
sll_nt {A,B,V) 
sra_nt  <A, B,  V) 

trai  l_bda  (X)  Push  the  address  and  value  of  X  on  the  trail  stack  if  the  trail  condition 

X<r  (hb)  is  satisfied.  It  is  assumed  that  X  is  dereferenced.  When 
detrailing,  the  old  value  of  X  is  restored.  This  is  used  to  implement 
backtrackable  destructive  assignment.  Only  one  comparison  is  neces¬ 
sary  for  the  trail  check.  The  state  register  r  (hb)  points  to  the  heap 
location  which  was  the  top  of  the  heap  when  the  most  recent  choice 
point  was  created. 


8.  Instruction  arguments 
( 

This  section  defines  the  syntax  of  the  instructions’  arguments. 

_ Addressing  modes  for  equal,  move  and  push 

%  Effective  address  for  equal; 
ea_e(Var)  a_var{Var) . 

ea  e(VarOff)  var  off(VarOff). 


•■%  Effective  address  for  move: 


ea_m(Arg) 
ea_m(VarOf f ) 
ea_m(Tag'H) 


-  argtArg) . 

-  var_off (VarOff) . 

-  pointer_tag (Tag) ,  heap_ptr(H). 


%  Effective  address  for  push: 


ea_p(Arg) 
ea_p(Tag'H) 
ea_p(Tag' (H+D) ) 


-  arg_i(Arg). 

-  pointer_tag (Tag) ,  heap_ptr(H) . 

-  pointer_tag (Tag) ,  pos (D) ,  heap_ptr(H) 


Other  addressing  inodes 


heap_pt  r ( r ( h ) )  . 
choice  _ptr(r(b)). 

reg ( r ( I ) ) 

:  - 

int(I) . 

reg  (T) 

:  - 

usei:_reg  (T)  • 

hreg (R) 

reg  (R)  . 

hreg (R) 

:  — 

heap__ptr  (R)  . 

permtpd) ) 

natural (I) . 

an_atomic (I) 

:  - 

int (X)  . 

an_atomic (T*A) 

; 

atom(A) ,  atom_tag(T) . 

an_atomic (T* (F/N) ) 

:  — 

atom(F)«  pos(N),  atom_tag(T). 

a_var  (Reg) 

reg (Reg)  . 

a_var (Perm) 

:  — 

perm(Perm) . 

arg (Arg) 

a_var  (Arg)  . 

arg (Arg) 

Z  ““ 

an_atomic (Arg) . 

var_i  (Var) 

:  - 

a_var (Var) . 

var_i  ( (Var] ) 

:  - 

a_var(Var) . 

arg_i (Arg) 

var_i (Arg) . 

arg_i  (Arg) 

•  ^ 

an_^atomic  (Arg)  . 

numreg (Arg) 

* 

reg  (Arg)  . 

nuroreg(Arg) 

int  (Arg)  . 

numarg_i  (Arg) 

var_i (Arg) . 

nuniarg_i  (Arg) 

I  ~ 

int (Arg) . 

var_of f ( (Var] ) 

;  - 

a_var (Var) . 

var_of £ ( (Var+I] ) 

a_var (Var) ,  pos (I) . 

_ Tagsyniax _ 

a_tag(tatni)  .  /•  atom  •/ 
a_tag(tint).  /•  integer  */ 
a_tag(tneg) .  /*  negative  integer  */ 
a_tag(tpos).  /*  nonnegative  integer  «/ 
a_tag(tstr) .  /•  structure  */ 
a_tag(tlst).  /•  cons  cell  •/ 
a_tag(tvar) .  /*  variable  •/ 

atom_tag (tatm) . 

pointer_tag (tstr) . 
pointer_tag (tlst) . 
pointer_tag(tvar) . 


Conditionals  syntax 

eq_ne (eq) . 

eq_ne (ne) . 

cond (eq) . 

/• 

Equal  •/ 

cond (ne) . 

/* 

Not  equal.  */ 

cond ( Its) . 

/• 

Signed  less  than  •/ 

cond (les)  . 

/* 

Signed  less  than  or 

equal  */ 

cond (gts)  . 

/* 

Signed  greater  than 

*/ 

cond(ges)  . 

/• 

Signed  greater  than 

or  equal  */ 

Miscellaneous  syntax 


hash_type (atomic) . 
hash_type (Structure) . 

Ibl(fail) . 

lbl(N/A)  acom(N),  natural  (A). 

Ibl (1 (N/A,  1 ) )  atom(N),  natural(A),  int{I). 

nv_f lag (nonvar) . 
nv_f lag (var) . 
nv_f  lag  ('?'). 

%  A  list  of  register  numbers: 

%  (May  contain  the  value  'no'  as  well) 
regs  ( (] )  . 

tegs ( (R| Setl )  {int(R);  R»no) ,  regs (Set). 


Ulilily  predicates 


ground (X)  nonvar CX),  functor (X,  N) ,  ground (N,  X). 
ground (N,  _)  N“;-0. 

ground(N,  X)  N“\-0,  arg(N,X,A),  ground(A),  N1  is  N-1,  ground(Nl,X) 


int (N)  integer (N). 

natural (N)  integer (N),  N>«0. 

pos(N)  integer (N),  N>0. 
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Appendix  E 

Extended  DCG  notation: 

A  tool  for  applicative  programming  in  Prolog 


1.  Introduction 

This  appendix  describes  a  preprocessor  that  sintplibes  purely  applicative  programming  in  Prolog. 
The  preprocessor  generalizes  Prolog’s  Definite  Clause  Grammar  (DCG)  notation  to  allow  programming 
with  multiple  accumulators.  It  has  been  an  indispensable  tool  in  the  development  of  the  Aquarius  Prolog 
compiler.  Its  use  is  transparent  in  versions  of  Prolog  that  conform  to  the  Edinburgh  standard.  The  prepro¬ 
cessor  and  a  user  manual  are  available  by  anonymous  fip  to  atpa.berkeley.edu. 

It  is  desirable  to  program  in  a  purely  applicative  style,  i.e.  within  the  pure  logical  subset  of  Prolog. 
In  that  case  a  predicate's  meaning  depends  only  on  its  definition,  and  not  on  any  outside  information.  This 
has  two  important  advantages.  First,  it  gready  simplifies  verifying  correctrtess.  Simple  inspection  is  often 
sufficient.  Second,  since  all  inforrrution  is  passed  locally,  it  makes  the  program  more  amenable  to  parallel 
execution.  However,  in  practice  the  number  of  arguments  of  predicates  written  in  this  style  is  large,  which 
makes  writing  and  maintaining  them  difficult.  Two  ways  of  getting  around  this  problem  are  (1)  to  encapsu¬ 
late  information  in  compound  struaures  which  are  passed  in  single  arguments,  and  (2)  to  use  global  instead 
of  local  information.  Both  of  these  techniques  arc  commonly  used  in  imperative  languages  such  as  C,  but 
neither  is  a  satisfying  way  to  program  in  Prolog,  for  the  following  reasons: 

«  Because  Prolog  is  a  single-assignment  language,  modifying  encapsulated  information  requires  a 
time-consuming  copy  of  the  entire  strucuire.  Sophisticated  optimizations  could  make  this  efficient, 
but  compilers  implementing  them  do  not  yet  exist. 

•  Using  global  information  destroys  the  advantages  of  programming  in  an  applicative  style,  such  as  the 

ease  of  mathematical  analysis  and  the  suitability  for  parallel  execution. 

A  third  approach  with  neither  of  the  above  disadvantages  is  extending  Prolog  to  allow  an  arbitrary  number 
of  arguments  without  increasing  the  size  of  the  source  code.  The  extended  Prolog  is  translated  into  stan¬ 
dard  Prolog  by  a  preprocessor.  This  report  describes  an  extension  to  Prdog's  Definite  Clause  Grammar 
notation  that  implements  this  idea. 

2.  Definite  Clause  Grammar  (DCG)  notation 

DCG  notation  was  developed  as  the  result  of  research  in  natural  language  parsing  and  understanding 
(Pereira  &  Warren  1980].  It  allows  the  specification  of  a  class  of  attributed  unification  grammars  with 
semantic  actions.  These  grammars  are  strkily  more  powerful  than  context-free  grammars.  Prologs  that 
conform  to  the  Edinburgh  standard  (Clocksin  &  Mellish  1981]  provide  a  built-in  preprocessor  that 
translates  clauses  written  in  DCG  notation  into  standard  Prolog. 

An  important  Prolog  programming  technique  is  the  accumulator  (Sterling  &  Shapiro  1986].  The 
DCG  notation  irrtplements  a  single  implicit  accumulator.  For  example,  the  DCG  clause; 

t*nn(S)  — >  factor(A),  l+l,  factortB).  |S  is  A+Bl . 

is  translated  internally  into  the  Prolog  clause: 

tarwtS.Xl.XO  factor (A.X1.X2J .  X2-|4|X31.  factor (B.X3.X4) .  S  is  A+B. 

Each  predicate  is  given  two  additional  arguments.  Chaining  together  these  arguments  implements  the 
accumulator. 
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3.  Extending  the  DCG  notation 

The  DCG  notation  is  a  concise  and  clear  way  to  express  the  use  of  a  single  accumulator.  However, 
in  the  development  of  large  Prolog  programs  I  have  found  it  useful  to  carry  more  than  one  accumulator.  If 
written  explicitly,  each  accumulator  requires  two  additional  arguments,  and  these  arguments  must  be 
chained  together.  This  requires  the  invention  of  many  arbitrary  variable  names,  and  the  chance  of  intro¬ 
ducing  errors  is  targe.  Modifying  or  extending  this  code,  for  example  to  add  another  accumulator,  is  tedi¬ 
ous. 

One  way  to  solve  this  problem  is  to  extend  the  DCG  notation.  The  extension  described  here  allows 
for  an  unlimited  number  of  named  accumulators,  and  handles  all  the  tedium  of  parameter  passing.  Each 
accumulator  requires  a  single  Prolog  fact  as  its  declaration.  The  bulk  of  the  program  source  does  not 
depend  on  the  number  of  accumulators,  so  maintaining  and  extending  it  is  simplified.  For  single  accumula¬ 
tors  the  notation  defaults  to  the  standard  DCG  notation. 

Other  extensions  to  the  DCG  notation  have  been  proposed,  for  example  Extraposition  Grammars 
(Pereira  1981]  and  Definite  Clause  Translation  Grammars  [Abramson  1984].  The  motivation  for  these 
extensions  is  natural-language  analysis,  and  they  are  not  directly  useful  as  aids  in  program  construction. 

4.  An  example 

To  illustrate  the  extended  notation,  consider  die  following  Prolog  predicate  which  converts  infix 
expressions  containing  identifiers,  integers,  and  addition  (+)  into  machine  code  for  a  simple  stack  machine, 
and  also  calculates  the  size  of  the  code: 

expr_code (A+B.  SI  S4,C1.C4):- 
expr_co<ie  ( ,  SI,  S2,  Cl,  C2), 
expr  co4e (B,  S2,  S3,  C2,  C3) , 

C3-(plus|C41 ,  /*  Explicitly  accumulate  'plus'  •/ 

s;  is  S3+1.  /•  Explicitly  add  1  to  the  size  •/ 

expr_code<I,  SI,  S2,  Cl,  C2) 
atomic (I) , 

Cl-lpush(I) (C2I , 

S2  is  Sl+1. 

This  predicate  has  two  accumulators:  the  madiine  code  and  its  size.  A  sample  call  is 
expr_code  (a+3+b,  0,  Size, Code,  (]),  which  returns  the  result 

Size  -  5 

Code  -  (push (a) , push (3) , plus, push (b) .plus] 

With  DCG  notation  it  is  possible  to  hide  the  code  accumulator,  although  the  size  is  still  calculated  expli¬ 
citly: 

expt_cod€ (A+B,  SI,  S4)  — > 
expr_code(A,  SI,  S2), 
expr_code(B,  S2,  S3), 

(plus],  /*  Accumulate  'plus'  in  a  hidden  accumulator  */ 

{S4  is  S3'*’!}.  /*  Explicitly  add  1  to  the  size  */ 

expr_code(I,  SI,  S2)  — > 

(atomic (I) ) . 

(push  (I)  ) , 

(S2  is  Sl-H). 

The  extended  notation  hides  both  accumulators: 
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expr_code (A+B)  — » 
expr_code (A)  , 
expr_code (B)  , 

(plus) :code,  /•  Accumulate  'plus'  in  the  code  accumulator  */ 

(l]:size.  /*  Accumulate  1  in  the  size  accumulator  */ 

expr_code(I)  — » 

(atom! c ( I )  ) , 

(push (1)1 :code, 

( 1 ] : size . 

The  translation  of  this  version  is  identical  to  the  original  definition.  The  preprocessor  needs  the  following 
declarations: 

acc_info (code,  T,  Out,  In,  (Out"(TI In) ) ) ./•  Accumulator  declarations  */ 
acc_in£o (size,  T,  In,  Out,  (Out  is  In+T) ) . 

pred_info(expr_code,  1,  (size, code] ) .  /•  Predicate  declaration  •/ 

For  each  accumulator  this  declares  the  accumulating  function,  and  for  each  predicate  this  declares  the 
name,  arity  (number  of  arguments),  and  accumulators  it  uses.  The  order  of  the  In  and  Out  arguments 
determines  whether  accumulation  proceeds  in  the  forward  direction  (see  size)  or  in  the  reverse  direction 
(sec  code).  Choosing  the  pro()er  direction  is  important  if  the  accumulating  function  requires  some  of  its 
arguments  to  be  instantiated. 

5.  Concluding  remarks 

An  extension  to  Prolog’s  DCX)  mtation  that  implements  an  unlimited  number  of  named  accumula¬ 
tors  was  developed  to  simplify  purely  applicative  Prolog  programming.  Comments  and  suggestions  for 
improvements  are  welcome. 
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Extended  DCG  notation:  t 

A  tool  for  applicative  programming  in  Prolog 

User  Manual 


1.  Introduction 

This  manual  describes  a  preprocessor  for  Prolog  (hat  adds  an  arbitrary  number  of  arguments  to  a 
predicate  without  increasing  the  size  of  the  source  code.  The  hidden  arguments  are  of  two  kinds; 

(1)  Accumulatcis.  useful  for  results  that  are  calculated  incrementally  in  many  predicates.  An  accumula¬ 
tor  expands  into  (wo  additional  arguments  per  {dedicate. 

(2)  Passed  arguments,  used  to  pass  global  information  to  many  predicates.  A  passed  argument  expands 
into  a  single  additional  argument  per  predicate. 

The  preprocessor  has  been  tested  under  C-Prolog  and  Quintus  Prolog.  It  is  being  used  by  the  author  in  pro¬ 
gram  development,  and  is  believed  to  be  relatively  bug-free.  However,  it  is  still  being  refined  and 
extmded.  The  most  recent  version  is  available  by  anonymous  ftp  to  aipa.berkeley.edu  or  by  contacting  the 
author.  Please  let  me  know  if  you  find  any  bugs.  Comments  and  suggestions  for  improvements  are  wel¬ 
come. 

2.  Using  the  preprocessor 

The  preprocessor  is  implemented  in  (he  file  accumulator.pl.  It  must  be  consulted  or  compiled 
before  the  programs  that  use  it  In  Prologs  that  conform  to  the  Edinburgh  standard,  such  as  C-Prolog  or 
Quintus  Prolog,  the  user-defined  predicate  term_expansion/2  is  called  when  consulting  or  compiling 
each  clause  that  is  read.  With  this  hook  the  use  of  the  preprocessor  is  transparent 

Clauses  to  be  expanded  arc  of  the  form  (Head — »Body)  where  Head  and  Body  arc  the 
head  and  body  of  the  clause.  The  head  is  always  expanded  with  all  of  its  hidden  arguments.  Table  1  sum¬ 
marizes  the  expansion  rules  for  body  goals.  In  the  table.  Goal  denotes  any  goal  in  a  clause  body.  Acc 
denotes  an  accumulator.  Pass  denotes  a  passed  argument,  and  Arg  denotes  either  an  accumulator  or  a 
passed  argument.  Hidden  arguments  of  body  goals  that  are  not  in  the  head  have  default  values  which  can 
be  overridden.  For  compatibility  with  DCG  notation  the  accumulator  deg  is  available  by  default.  If- 
ihen-else  is  not  handled  in  (his  version. 

The  preprocessor  assumes  the  existence  of  a  database  of  information  about  the  hidden  parameters 
and  the  predicates  to  be  expanded.  Three  relations  are  recognized;  a  declaration  for  each  predicate,  each 
accumulator,  and  each  passed  argument  These  relations  can  be  put  at  the  begirming  of  each  file  (in  which 
case  their  scope  is  the  file)  or  stored  in  a  separate  file  that  is  consulted  first  (in  which  case  their  scope  is  the 
whole  program). 

A  short  example  gives  a  flavor  of  what  the  preprocessor  docs; 

%  Declare  the  accumulator  'castor': 
acc_info (castor.  truef. 

%  Declare  the  passed  argument  'pollux* : 
pass_info<pollux|  . 

t  Declare  three  predicates  using  these  hidden  arguments: 
pred_info(p.  1,  [castor, pollujc] ) . 
pred_info(q,  1,  (castor, pollux})  • 
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Table  1  —  Expansion  rules  for  the  preprocessor 

Body  goal 

Action 

(Goal  1 

Don’t  expand  any  hidden  arguments  of  Goal. 

Goal 

Expand  all  of  the  hidden  parameters  of  Goal  that  are  also  in  the 
head.  Those  hidden  parameters  not  in  the  head  are  given  default 
values. 

Goal :  L 

If  Goal  has  no  hidden  arguments  then  force  the  expansion  of  all 
arguments  in  L  in  the  order  given.  If  Goal  has  hidden  argu¬ 
ments  then  expand  all  of  them,  using  the  contents  of  L  to  override 
the  expansion.  L  is  either  a  term  of  the  form  Acc. 

Acc  (Left,  Right) .  Pass.  Pass  (Value) .  or  a  list  of  such 
terms.  When  present,  the  arguments  Left,  Right,  and  Value 
override  the  default  values  of  arguments  not  in  the  head. 

List : Acc 

J 

Accumulate  a  list  of  terms  in  the  accumulator  Acc. 

List 

Accumulate  a  list  of  terms  in  the  accumulator  deg. 

X/Arg 

Unify  X  with  the  left  term  for  the  accumulator  or  passed  argument 

Arg. 

Acc/X 

Unify  X  with  the  right  term  for  accumulator  Acc. 

X/Acc/Y 

Unify  X  with  the  left  and  Y  with  the  right  term  for  the  accumula¬ 
tor  Acc. 

insert (X, Y) 

Acc  Insert  the  arguments  X  and  Y  into  the  chain  implementing  the  ac¬ 
cumulator  Acc.  This  is  useful  when  the  value  of  the  accumulator 
changes  radically  because  X  and  Y  may  be  the  arguments  of  an 
arbitrary  relation. 

insert  <X, Y) 

Insert  the  arguments  X  and  Y  into  the  chain  implementing  the  ac¬ 
cumulator  deg  .  This  inserts  the  difference  list  x-Y  into  the  ac¬ 
cumulated  list 

pred_info(r,  1,  { castor, pollux] ) . 


%  The  program: 

p(X)  ~»  y  is  X+1,  q(Y).  r(y). 

This  example  declares  one  accumulator,  one  passed  argument,  and  three  predicates  using  them.  The  pro¬ 
gram  consists  of  a  single  clause.  The  preprocessor  is  used  as  follows:  (bold-facc  denotes  user  input) 

%  cprolog 

C-Prolog  version  1.5 
I  ?-  ('accuiiiulator.pl']. 

accumulator.pl  consulted  9780  bytes  1.7  sec. 
yes 

I  ?-  ('axainple.pl']. 

example.pl  consulted  668  bytes  0.25  sec. 

yes 
I  ?- 

Now  the  predicate  p(X)  has  been  expanded.  Wc  can  sec  what  it  looks  like  with  the  listing  com¬ 
mand: 


I  ?-  liating(p). 

p(X,  SI,  S3,  P)  y  is  X+1,  q  (Y,  SI,  S2.  P) ,  rtY.  S2,  S3,  P) . 

(Variable  names  have  been  changed  for  clarity.)  The  arguments  SI,  S2,  and  S3,  which  implement  the 


accumulator  castor,  arc  chained  together.  The  argument  P  implements  (he  passed  argument  It  is 
added  as  an  extra  argument  to  each  predicate. 

In  object-oriented  terminology  the  declarations  of  hidden  parameters  correspond  to  classes  with  a 
single  method  defined  for  each.  Declarations  of  predicates  specify  the  inheritance  of  the  predicate  from 
multiple  classes,  namely  each  hidden  parameter. 

3.  Declarations 

3.1.  Declaration  of  the  predicates 

Predicates  are  declared  with  facts  of  the  following  form; 
pred_inf o (Name ,  Arity.  List) 

The  predicate  Name  /Arity  has  the  hiddra  parameters  given  in  List.  The  parameters  arc  added  in  the 
order  given  by  List  and  their  names  must  be<ttoms. 

3.2.  Declaration  of  the  accumulators 

Accumulators  are  declared  with  facts  in  one  of  two  forms.  The  shon  form  is; 

acc_info (Acc,  Term,  Left,  Right,  Joiner) 

The  long  form  is; 

acc_info (Acc,  Term,  I,eft,  Right,  Joiner,  LStart,  RStart) 

In  rnost  cases  the  short  form  gives  sufficient  informuion.  It  declares  the  accumulator  Acc,  which  must  be 
an  atom,  along  with  the  accumulating  function.  Joiner,  and  its  arguments  Term,  the  tenn  to  be  accu¬ 
mulated,  and  Left  &  Right,  the  variables  used  in  chaining. 

The  long  form  of  acc_in£o  is  useful  in  more  complex  programs.  It  contains  two  additional  argu¬ 
ments.  LStart  and  RStart,  (hat  are  used  to  give  default  starting  values  for  an  accumulator  occurring 
in  a  body  goal  that  does  not  occur  in  the  head.  The  starting  values  are  given  to  the  unused  accumulator  to 
ensure  (hat  it  will  execute  correctly  even  though  its  value  is  not  used.  Care  is  needed  to  give  conea  values 
for  LStart  and  RStart.  For  DCG-Iike  list  accumulation  both  may  remain  unbound. 

Two  conventions  are  used  for  the  two  variables  used  in  chaining  depending  on  which  direction  the 

accumulation  is  done.  For  forward  accumulation.  Left  is  (he  input  and  Right  is  the  output  For 

reverse  accumulation.  Right  is  the  input  and  I,e£t  is  (he  output 

To  see  how  these  declarations  work,  consider  the  following  program; 

%  Example  illustrating  the  difference  between 
\  forward  and  reverse  accumulation: 

%  Declare  the  accumulators: 

acc_info (fwd,  T,  In,  Out,  Out—ITIInJ).  %  Forward  accumulator. 
acc_info  (rev,  T,  Out,  In,  Out-(T|In)).  %  Reverse  accumulator. 

%  Declare  the  predicates  using  them: 
pred_info (flist,  1,  (fwd}). 

pred_info(rlist,  1,  (rev)). 

%  flist(N,  (],  List)  creates  the  list  (1.  2,  ...,  N) 

flist  (0)  — »  (). 

flist (N)  — »  M>0,  (Ni:fwd,  N1  is  N-1,  flist <N1). 

%  rlisttN,  List,  ())  creates  the  list  (N,  ....  2,  1) 

rlist  (0)  — >>  ( 1 . 

rlist<N)  — >>  N>0,  (N]:rev,  N1  is  N-1,  rlist (Nl). 

This  defines  two  accumulators  fwd  and  r«v  (hat  both  accumulate  lists,  but  in  different  directions.  The 
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joiner  of  both  accumulators  is  the  unification  Out~{T|  in],  which  adds  T  to  the  head  of  the  list  In 
and  creates  the  list  Out.  In  accumulator  fwd  the  output  Out  is  the  left  argument  and  the  input  in  is 
the  right  argument.  This  builds  the  list  in  ascending  order.  Switching  the  arguments,  as  in  the  accumulator 
rev,  builds  the  list  in  reverse.  A  sample  execution  gives  these  results: 


1  ?- 

fl 

istdO, 

1 ) ,  List)  . 

List 

- 

11.2,3,4 

,5,6,7.8,9.10) 

yes 

1  ?- 

rl 

ist (10, 

List,  1 J )  . 

Li  St 

- 

110,9,8, 

7, 6, 5, 4, 3. 2.1) 

yes 

If  the  joining  function  is  not  reversible  then  the  accumulator  can  only  be  used  in  one  direction.  For  exam¬ 
ple,  the  accumulator  add  with  declaration: 

acc_info (add,  I,  In,  Out,  Out  is  I+In) . 

It  can  only  be  used  as  a  forward  accumulator.  Attempting  to  use  it  in  reverse  results  in  an  error  because  the  • 
argument  In  of  the  joiner  is  uninstantiated.  The  reason  for  this  is  that  the  predicate  is/ 2  is  not  pure 
logic:  it  requires  (he  expression  in  its  right-hand  side  to  be  gT'^und. 

33;  Declaration  of  the  passed  arguments 

Passed  arguments  are  declared  as  facts  in  one  of  two  forms.  The  shoa  form  is: 

pass_info (Pass) 

The  long  form  is: 

pass_info (Pass,  PStart) 

In  most  cases  the  shon  form  is  sufheienL  It  declares  a  passed  argument  Pass,  that  must  be  an  atom.  The 
long  form  also  contains  the  starting  value  PStart  that  is  used  to  give  a  default  value  for  a  passed  argu¬ 
ment  in  a  body  goal  that  does  not  occur  in  the  head.  Most  of  the  time  this  situation  does  not  occur. 

4.  Tips  and  techniques 

Usually  there  will  be  one  clause  of  pred_in£o  for  each  predicate  in  the  program.  If  the  program 
becomes  very  large,  the  number  of  clauses  of  pced_in£o  grows  accordingly  and  can  become  difficult 
to  keep  consistent.  In  that  case  it  is  useful  to  remember  that  a  single  pred_in£o  clause  can  summarize 
many  facts.  For  example,  the  following  declarau'on: 

pred_in£o(_,  List) . 

gives  all  predicates  the  hidden  parameters  in  List.  This  keeps  programming  simple  regardless  of  the 
number  of  hidden  parameters. 
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Appendix  F 

Source  code  of  the  C  and  Prolog  benchmarks 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

/*  C  version  of  tak  benchmark  */ 

#include  <stdio.h> 

int  tak<x,y,z) 
int  X,  y,  2; 

{ 

irt  al,  a2,  a3; 
if  (x  <=  y)  return  z; 
al  =  tak (x-l,y, z) : 
a2  =  tak (y-l>  z, x) ; 
a3  =  tak(2-l,x,y)  ; 
return  tak (al, a2, a3) ; 


main ( ) 

{ 

printf ("%d\n",  tak(24,  16,  8)); 

I 


/•  Prolog  version  of  tak- benchmark  */ 

main  tak (24, 16, 8, X) ,  write(X),  nl. 

tak(X,y,Z,A)  X  =<  Y,  Z  =  A. 

tak(X,Y,Z,A)  X  >  Y, 

XI  is  X  -  1,  tak(Xl, Y, Z,A1) , 

Y1  is  Y  -  1,  tak(Yl,Z,X,A2) , 

21  is  Z  -  1,  tak(Zl,X,Y,A3) , 
tak(AI,A2,A3,A) . 

%%%%%%%%%%%%%«%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

/*  C  version  of  fib  benchmark  •/ 

♦include  <stdio.h> 

int  fib(x) 
int  x; 

I 

if  (X  <«  1)  return  1; 
return  (£ib(x-l) -^f  ib(x-2)  )  ; 
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main  ( ) 

1 

printf  r%d\n",  fib(30)); 

) 


/*  Prolog  version  of  fib  benchmark  */ 

main  fib(30,N),  write(N),  nl . 

fib{N,F)  N  »<  1,  F  =  1. 
fib(N,F)  N  >  1, 

Nl  is  N  -  1,  fib(Nl,Fl) , , 

N2  is  N  -  2,  fib(N2,F2), 

F  is  FI  +  F2. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 

/*  C  version  of  hanoi  benchmark  •/ 

♦include  <stdio.h> 

han (n, a,b, c) 

1 

int  nl; 

if  (n<*0)  return; 
nl  =  n-1; 
han (nl, 3/ c, b)  ; 
han  (nl,  c,  b,  a)  ; 


main ( ) 

1 

han (20, 1,2,3}; 

} 


/*  Prolog  version  of  hanoi  benchmark  •/ 

main  han (20, 1, 2, 3) . 

han (N, N*<0. 
han(N,A, B,C)  N>0, 

Nl  is  N  -  1, 
han(Nl,A,C,B) , 
han(Nl,C,B,A> . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 


/*  C  version  of  quicksort  benchmark  •/ 
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#include  <stdio.h> 
int  ilist [50] 


(27,74,17,33,94,18,46,83,65,  2, 
32,53,28,85,99,47,28,82,  6,11, 
55,29,39,81,  90,37,10,  0,66,51, 
7,21,85,27,31,  63,75,  4,95,99, 
11,28,61,74,18,92,40,53,59,  8); 


int  list  (50]; 


qsort (1,  r ) 
int  1,  r; 

( 

int  V,  t,  i,  j; 
if  (Kr)  1 

v=listll];  i=l;  j=r+l; 
do  ( 

do  1++;  while  (list(i]<v); 
do  j — ;  while  (list(j]>v); 
t=list(j];  list(j]=list(i];  list(i]=t; 

)  while  ( j>i) ; 

list[i]=list[ j];  list  I j ] =list ( 1] ;  list{l]=t; 
qsort (1, j-1) ; 
qsort { j+1, r) ; 


) 


] 


main  () 
( 


int  1,  d; 

for(j=0;  j<10000;  j++)  ( 

for {i-0; i<50; i++)  list (i]=ilist  [i] ; 
qsort (0,49) ; 

) 

for(i*0;  i<50;  i++)  printf  (”%d  ’*,listli)); 
print  £  ("\n'')  ; 


/*  Prolog  version  of  quicksort  benchmark  •/ 

main  range (1, I, 9999) ,  qsort {_) ,  fail, 
main  qsort (S),  write (S),  nl. 

range  (L,  L,  H)  . 

range(L,I,H)  L<H,  LI  is  L+1,  range (LI, I, H) . 


qsort(S)  qsort ( (27, 74, 17,  33,  94,  18, 46, 83, 65,  2, 

32,53,28,85,  99,  47,28,82,  6,11, 
55,29,39,81,90,37,10,  0,66,51, 
7,21,85,27,31,63,75,  4,95,99, 
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11,28,61,74,18,92,40,53,59,  8),S, (]). 

qsort ( IX 1 L] , R, RO) 

partition  <L, X, LI, L2) , 
qsort (L2, Rl, RO) , 
qsort(Ll,R, (XIRIJ) . 
qsort ( I ] ,R, R) . 

partition((y(L],X,  (y(H],L2)  y-<X,  partition  (L,  X,  LI,  l2)  . 

partition{iy|L],X,Ll, [yiL2])  y>X,  partition (L, X, LI, L2) . 
partition ((],_, [ ] , [ ] ) . 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%^%%% 
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Appendix  G 

Source  code  of  the  Aquarius  Prolog  compiler 


Due  to  the  si7.c  of  the  source  code,  it  has  not  been  included  here.  The  complete  Aquarius  system 
including  source  code  will  be  distributed  in  Spring  1991.  The  source  code  of  the  compiler  may  also  be 
obtained  from  the  author. 


Files  in  the  compiler 

File 

Description 

accumulator .pi 

Extended  DCG  preprocessor 

accumulator_cleanup .  pi  Cleanup  file  needed  for  preprocessor 

analyze.pl 

Dataflow  analyzer 

clause_code .pi 

Clause  compiler 

conditions.pl 

Formula  manipulation  utilities 

cortqpiler  .pi 

Top  level  of  compiler,  includes  type  enrichment 

expression.pl 

Compile  arithmetic  expressions 

factor.pl 

Factoring  iransfoimation 

flatten.pl 

Flattening  transformation 

inline.pl 

In-line  replacement 

mutex.pl 

Mutual  exclusion  and  implication  of  formulas 

peephole.pl 

BAM  iransfonnations  (except  synonym) 

preamble. pi 

Pan  of  standard  form  transformation 

proc_code . pi 

Predicate  compiler 

regalloc .pi 

Roister  allocator 

segment .pi 

Head-body  segmentation  and  goal  reordering 

selection.pl 

Deteminism  extraction 

standard.pl 

Standard  fonn  iransfonnation 

synonym.pl 

Synonym  t^mizaiion 

tables.pl 

Compilation  ubics 

testset.pl 

List  of  lest  sets 

transform_cut .pi 

Cut  transformation 

unify.pl 

Unifiation  compiler 

utility.pl 

Utility  predicates 
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ABSTRACT 

Most  Prolog  machines  have  been  based  on  spedaWied  architec- 
tuies.  Our  goal  is  to  start  with  a  general  purpose  architecture  and 
determine  a  minimal  set  of  extensions  for  high  performance  Pitriog 
erentfion  We  have  developed  both  the  architecture  and  optimizing 
compiler  simultaneously,  drawing  on  results  of  previous  impiemen- 
tations.  We  find  that  most  Prolog  qtedfic  operations  can  be  done 
satisfactorily  in  software:  however,  there  is  a  crucial  set  of  features 
that  the  architecture  must  suppon  to  achieve  the  best  Prolog  perfor¬ 
mance.  The  emphasis  of  this  paper  is  on  our  architecture  and 
instruction  set  The  costs  and  benefits  of  the  special  architectural 
features  and  instructions  are  analyzed.  Simulated  performance 
results  ate  presented  and  indicate  a  peak  compiled  Prolog  petfior- 
mance  of  3.68  million  logical  mferences  per  second. 


1.  Introduction 

Logic  programming  in  general  and  Prolog  (1]  in  particular 
have  became  popular  for  rapid  software  prototyping,  natural 
language  translation,  and  expert  system  programming.  Protog’s  use 
of  dynamic  typing,  backtracking,  md  unification  place  heavy  compu¬ 
tational  demands  on  general  purpose  computers.  In  an  attempt  to 
achieve  ever  higher  performance,  several  special  purpose  architec¬ 
tures  have  been  proposed  and  built  Early  nolog  architectures  [2] 
were  mictrrooded  iitttpreiets.  Because  no  oompilmioo  was  done, 
performance  was  disappointing.  Hitfier  perfoimanoe  processors  13- 
6]  have  since  been  based  on  the  Warren  Absoaa  Madiine  (WAM) 
[7].  Tbeir  instructioo  sets  were  derived  fiom  the  WAM  to  support 
execution  of  Prolog  programs.  These  processors  are  special  purpose, 
fflicroooded  engines  which  depend  on  parallel  execution  of  opera¬ 
tions  within  each  teladvely  ooarse-gtal^  farsouciion  for  high  per¬ 
formance.  Initial  designs  tatpiemensed  only  the  Instractions  that 
supported  the  WAM  and  depended  on  a  test  processor  for  non- 
WAM  computations.  To  support  Prolog  built-ins  (primidve  Prolog 
operatioos  provided  by  the  system)  and  system  I/O.  newer  desipis 
incorporate  getmtal  purpose  initructlotn  to  minimiie  dependence  on 
a  host  Alteniatively.  the  use  of  a  aiinple,  non-WAM  faBtrrictioo  set 
belter  suppotttoomifileropdmizaiioo.  Several  such  special  purpose 
reduced  instructioo  set  architectures  have  been  proposed  fior  logic 
programming  (8-11].  Ihese  architectures  indude  prirrdtives  wfaicb 
support  die  use  of  lagged  data,  pointer  detefierenoe,  and  mnU-vny 
btanches.  Our  hypothesis  it  that  providing  suppon  fbr  both  oompiler 

optimization  and  low-Jevel  opetidons  can  bM  be  aooampiished  by 
extending  a  aiinpie  general  purpose  aichitectnre  to  wppon  Prolog 
without  oomproittiaing  the  genetal  puipoae  petfiotmanoe. 

Ihe  peribroance  hnprovemcnis  of  leoent  general  popose 
arcMtecoires  over  nlder  ar^tecniies  can  be  traced  to  rcaearch  in 
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which  both  the  compiler  and  architeciute  were  developed  together 
[12-14],  Architectural  features  that  cannot  be  used  by  the  compiler 
or  which  cannot  demonstrate  performance  impeovemem  are  not 
included.  Likewise,  architecmral  features  are  added  which  suppon 
often  used  primitive  operations.  We  have  adopted  dus  approach 
fiom  the  beguming  of  our  project 

It  has  been  conjectured  that  commeicial  special  purpose  sym¬ 
bolic  processing  architectures  are  doomed  because  they  are  not  com¬ 
modity  items,  and  consequendy.  economics  prevent  them  fiom  stay¬ 
ing  on  the  leading  edge  of  implrmciuation  technology.  However,  if 
the  architectural  features  necessary  to  improve  ^jmbolic  perfor¬ 
mance  are  modest  and  do  not  intetfere  with  the  general  purpose 
architecture,  then  as  more  chip  area  becomes  available,  funue  imple¬ 
mentations  of  general  purpose  processors  can  deliver  Ugh  perfor¬ 
mance  symbolic  computing  in  a  standard  product.  Wehopethatour 
work  is  a  step  rewards  this  result 

This  paper  presents  the  design  of  a  processor  based  on  the 
Berkeley  Abstraa  Machine  (BAM)  architecture  and  modvates  its 
design  with  the  resultt  of  our  preUniinaiyimdies.  We  also  present  a 
brief  discussion  of  the  optimi^  compOer,  a  costfeenefit  analysis  of 
the  arcUtectural  fieaturcs,  and  the  shnulared  petfbtmanoe.  Familiar¬ 
ity  with  the  WAM  is  helpful.  Section  2  gie  processor 

arcUtecture  and  hardware  implememaddn.  Section  3  peaents  the 
instruction  set  along  with  the  results  of  our  aoidies  which  motivated 
farsDuedoa  selectioa  Ihe  conyilation  of  Prolog  programs  is 
described  far  section  4,  and  far  anxion  5  we  presea  a  oostAren^t 
analysis  of  the  special  features  and  instructions.  Section  6  giv£(  & 
perfomumce  results.  Ihe  final  section  concludes  with  a  summary  of 
our  results. 

2.  Procesaor  Architecture  and  ImpIririrntsHoH 

The  BAM  processor  is  a  general  purpose,  singfe  chip,  pipelined 
procesaor  with  extmtions  to  support  Prolog  execution  (Rgure  1). 
Both  data  and  hrstruction  words  are  32  bits,  and  most  intuuctions 
execute  hi  a  sin^  cycle.  Hie  main  features  fbr  Prolog  are  tag  maiti- 
pulation  Qntegrated  imo  arithmetic  and  the  memory  system),  a 
double-word  data  port  to  memory,  special  brmrdi  on  tag  support  and 
aeveial  insanctions  to  suppon  our  execution  modd  for  Prolog. 

The  srcUteciaie  is  presented  tar  dettO  along  with  our  motiva¬ 
tions  in  the  subsections  below.  Retaining  a  core  general  purpose 
archiiectnre  imposes  constraiius  on  the  symboBc  extensions.  For 
example,  the  processor  should  be  able  to  hsndle  agged  data  keoas  as 
stagle  entities,  witfiixi  special  tiestmem  for  the  tags.  Wedscussthe 
lamlflcarions  of  lUs  on  the  word  format  and  the  virtoal  memory  sys¬ 
tem.  Then  we  picsem  tim  arcMtecture's  register  annciure  and 
aaemory  fanerfoce.  Finally,  we  preaem  some  details  et  tire  hnple- 
memation  such  as  the  pipdine  atrucniic  and  our  mechanism  fat 
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Figure  1 

Block  Dugram  of  the  BAM  ftocessor 
multiple-cycle  instnictions. 

2.1.  Word  Format 

Prolog  does  not  require  the  user  to  specify  the  type  of  a  dau 
item.  This  requires  that  tun  time  type  checking  be  implemented  by 
adding  a  tag  to  each  data  item  to  enc^  the  type  of  that  item.  Many 
Prolog  processots  handle  the  tag  and  value  fields  separately.  This 
approach  does  not  satisfy  our  goal  of  integrating  tagging  into  a  gen¬ 
eral  purpose  architecture.  Instead,  we  use  a  standard  32-bit  word 
length  and  place  the  ttg  in  the  most  significant  four  bits  of  the  word. 
Arithmetic  computations  and  addresses,  however,  use  the  entire  32- 
bit  word,  so  general  purpose  computations  are  not  affected  by 
Prolog’s  use  of  tags.  Tag  values  fitted  by  the  hatdware  are  those  for 
non-negative  integers  (0000)  and  negative  integers  (1111).  This 
selection  of  tags  for  integers  is  a  common  technique  used  by  lisp 
implememations  on  general  purpose  madiines  (IS].  We  have  also 
fixed  the  tag  value  for  variable  pointers  (tvar  « (XX)1)  to  increase  the 
number  of  bits  availatde  for  branch  di^tUcements  in  several  Prolog 
specific  instnictions.  All  other  tag  values  are  software  defined.  Our 
Prolog  implementation  uses  tags  similar  to  those  of  the  WAM. 

2.2.  Segmented  Virtual  Addresses 

One  consequence  of  using  both  the  tag  and  value  as  an  address 
is  that  each  dau  type  is  mapped  into  its  own  area  of  virtual  memory. 
For  Prolog's  execution  model  one  widies  to  place  several  dau  types 
in  the  tame  suck  or  heap.  One  possible  sdlutiao  U  to  mask  (aero) 
the  tag  bits  of  the  address  before  using  it  to  aooen  memory.  This 
solution  is  not  sadsfactoty  when  applied  »  appUeaiions  not  using 
tags  (for  examifie.  C  programs).  To  avoid  lliU  difficulty,  we  have 
introduced  a  segment  table  which  maps  the  most  significant  six  biu 
of  an  address  to  a  twdve-blt  value  ^gute  2).  An  addreK  before 
mapping  is  tefistted  to  as  a  short  virtual  address  (SVA),  and  the  3S> 
bit  address  resulting  fiom  the  mapping  is  tefetred  to  at  a  long  tdttual 
address  (LVA).  This  memory  segmenudoo  achemc  is  similar  to  foe 
wynentatiiw  used  in  foe  801  ptooessor  [16].  The  801  uses  aegmen- 
tadon  to  e«r«rd  foe  viiiual  address  spaoe;  however,  our  ptimaiy 
modvadoo  for  using  segmentadca  it  to  allow  multiple  dau  types  to 
be  mapped  m  foe  sue  LVA  segment.  Mapping  two  Mo  in  addition 
to  foe  tag  Mo  aDows  the  use  of  aeveral  tnemoty  areas  for  a  ^ven 


long  vinual  address 
Figure  2 

Segmentation  of  Virtual  Address  Space 

dau  type,  each  area  using  a  different  mapping.  At  one  extreme  all 
dau  types  can  be  mapped  to  foe  same  LVA  segmera  (this  is 
equivalent  to  masking  the  most  significam  six  address  bits).  At  foe 
other  extreme,  all  SVA  segments  can  be  mapped  to  distina  LVA 
segmetus.  In  our  cuiretu  implemenudon  of  ^log.  variable,  list, 
and  structure  pointers  are  mapped  to  foe  same  LVA  segment, 
whereas  foe  environment/dioice  point  suck,  the  trail  stack,  and  the 
symbol  table  ate  mapped  to  separate  segments. 

Another  use  of  segmenudon  is  for  sharing  dau  in  a  multipro¬ 
cessor  system.  In  this  case  foe  38-Mt  LVA  is  used  as  the  global  vir¬ 
tual  address  and  sharing  of  dau  by  cooperaiing  processes  is  done  at 
foe  segmem  level 

X3.  Memory  Interface 

The  high  memoiy  bandwidth  requiremetu  of  Prolog  dicutes 
separate  instruction  and  data  buses  (Figure  1).  In  addition,  we  have 
expanded  the  dau  bus  to  double-woid  width.  A  double-word  dau 
bus  is  motivated  by  Cailson's  study  (17]  of  the  architecnml  tequire- 
metus  of  high  performance  Prolog  processots.  Cailson  com(nled 
Prolog  programs  into  basic  register  natisfer  level  operations  and  then 
oompat^  them  into  more  complex  instnictions  while  enforcing 
microaichitecniral  consttaiius.  His  results  foow  that  the  best 
perfotmanceAost  tradeoff  occun  ufoen  foe  aicbiteciure  provides  a 
double-word  pon  to  dau  memory. 

A  douMe-word  memoiy  pon  improves  the  peifoimance  of  term 
creation  and  speeds  Mock  transfets  to  and  from  environmems  and 
choice  points.  Some  previoiis  Prolog  processon  suppon  fast  choice 
point  creation  and  restoration  forou^  tiie  use  of  specialized  buffets 
orsfaadowiegisteis(3.9].  Such  hardware  solutions  are  costly  and  do 
not  fit  our  goal  of  mainuining  a  general  puipoee  architecture. 
Instead,  we  rely  on  douMe-woid  memory  operations  and  on  compiler 
optimization  to  minimiM  shallow  badoncking  (18]. 

Our  processor  design  is  tightly  coupled  with  the  cache  design 
We  decided  against  on-chip  caches  sin^  in  our  case,  it  is  more 
^ipnpriate  to  use  processor  chip  area  for  aicMiectunl  features  and 
use  fut  dense  static  RAM  chips  for  laige  caches.  To  speed  cache 
wvw««ff.  however,  protection  violation  and  consistency  checks  and 
address  tag  comparison  are  done  on-cMp.  More  details  about  the 
cache  imerface  are  given  in  (19]. 

2A.  Baac  Architecture 

AH  programmer  visible  piDoeaor  regUteis  are  accessed  as  two 
tea  of  32  legisten:  foe  general  puipose  legisier  set  and  the  special 
leister  set.  The  general  puipose  le^stets  ate  used  for  procure 
atgumeix  passing,  tempomy  storage,  and  as  suck  pointere.  The 
only  general  purpose  leglsier  with  a  preassigned  use  is  the  continua- 


tion  pointer  (r31).  This  register  is  implicitly  set  to  the  return  atklress 
by  the  call  instruction.  All  other  uses  of  the  general  purpose  Ras¬ 
ters  are  defined  by  software  convention. 

The  special  registers  provide  access  to  the  processor  status 
word  (PSW),  program  counter  (PC),  partial  product/quotient  register 
(PQ),  segment  mapping  table,  cache  interface  configuration  registers, 
and  a  set  of  fifteen  extra  registers  (sO-sl4). 


2,5.  Implementation  Details 

The  execution  pipelitK  consists  of  five  stages  (Figure  3).  All 
instructions  which  modify  registers  or  memory  do  so  in  the  last  pipe¬ 
line  stage.  Bypassing  forwards  available  re^ts  of  calculations  to 
instructions  following  in  the  pipeline.  Hardware  interlocks  are  pro¬ 
vided  for  both  load  and  store  delays.  If  data  from  a  load  instruction 
is  used  by  the  next  instruction,  then  the  next  instruction  is  delayed  by 
a  cycle.  Also,  memory  instructions  immediately  following  a  store 
are  delayed  by  a  cycle. 
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BAM  Processor  Execution  Pipeline 

All  instructions  are  32  bits  with  a  6-bit  opcode  and  fixed  source 
register  format.  Instruction  execution  is  controlled  by  an  opcode 
pipeline  which  operates  in  parallel  with  the  execution  pipeline.  Each 
stage  of  the  opt^e  pipe  decodes  the  opcode  associated  with  that 
stage  of  the  execution  pipeline.  Multi-cycle  instructions  and  condi¬ 
tional  instructions  are  implemented  using  “iruemal  opcodes”  (20). 
The  internal  opcodes  of  multi-cycle  instructions  are  fetched  from  a 
PLA  and  inserted  into  the  opcode  pipeline.  When  an  internal  opcode 
is  insetted,  no  instruction  is  fetched  during  that  cycle.  Thus  a  single 
external  opcode  can  invoke  a  sequence  of  internal  opcodes  to  pro¬ 
vide  for  often  used  complex  operations  (for  example,  pointer  tfere- 
ferencing).  Internal  opcode  insertion  is  also  used  fer  atomic  syn¬ 
chronization  operations,  for  pipelitK  interlock  delays,  and  for  trap 
and  interrupt  handling.  Conditional  execution  is  implemented  by 
conditionally  replacing  an  opcode  in  the  opcode  pipe  with  an  internal 
opcode.  Our  design  uses  SS  external  opcodes  and  24  internal 
opcodes;  of  the  internal  opcodes,  nine  are  related  to  traps  (frap,  ift), 
13  implemem  multi-cycle  instructions  {drrf,  jtx,  ltd,  pushd,  las, 
Jmpr),  and  two  implemem  conditional  operation  instructions  (wif, 
pushi). 

“Fast  ug  logic”  is  used  to  inqdemem  single-cycle  lag- 
compare-and-branch  instructions.  The  fast  tag  logic  consists  of  an 
extra  register  file  whidi  duplicates  the  tag  portion  of  the  general  pur¬ 
pose  register  file  and  special  tag  comparison  logic  which  allows 
quick  ug  comparison  and  bratKh.  Previous  Prolog  processors  [3] 
have  also  duplicated  tag  biu  to  accelerate  branching  on  tag  value. 

The  general  purpose  register  file  has  two  read  ports  (one 
tingle-word  and  one  double-word)  and  two  write  pons  (b^  tingle- 
word).  This  port  structure  provides  the  bandwidth  required  by 
sin^e-cycle  double-word  memory  accesses  without  greatly  increas¬ 
ing  the  complexity  of  the  register  file  design. 

3.  Instruction  Set 

In  this  section  we  presem  the  BAM  instruction  aet  The 


instructions  are  divided  into  three  groups:  gerreral  purpose.  Prolog 
inspired  geiKral  purpose,  and  Prolog  specific.  The  getreral  purpose 
instructions  are  those  which  can  be  fou^  in  typical  processors.  The 
Prolog  inspired  instructions  are  those  which  are  not  often  presem  in 
general  purpose  processors,  but  which  can  still  be  used  for  general 
computation.  The  remaining  instructions  are  tailored  specifically  to 
the  requirements  of  Prolog  executioa 

The  general  purpose  instructions  are  summarized  in  Table  1.  It 
is  important  to  point  out  that  all  arithmetic  and  logic  operations 
operate  on  the  firll  32-bit  word.  Also,  conditional  branches  consist 
of  separate  compare  and  branch  instructions.  (Compare  instructions 
set  or  clear  the  TF  (true-false)  condition  code  biu  and  the  bratreh 
instructions  take  the  branch  when  TF  is  set  Branches,  jumps,  and 
calls  are  delayed  by  one  instruction.  The  instruction  in  a  branch 
delay  slot  can  always  be  executed  (br),  armulled  (turned  into  a  nop) 
if  the  branch  is  taken  (but ).  or  annulled  if  the  branch  is  not  taken 
{btan).  Both  directions  of  armulling  are  included  because  Prolog 
often  favots  annulling  when  the  branch  is  taken  (for  example, 
btattching  out  of  straight-IitK  code  to  the  unification  failure  routitK), 
whereas  conditional  branches  to  the  top  of  a  loop  (common  in  pro¬ 
cedural  languages)  favor  annulling  when  the  branch  is  not  taken 

The  remainder  of  this  section  motivates  and  pteseius  our  exten¬ 
sions  to  the  general  purpose  instruction  set  A  major  influetKe  on  tiK 
design  of  these  extensions  was  the  simultaneous  developmera  of  an 
optimizing  Prolog  compiler.  The  abstract  machitK  used  by  the  com- 
l^er  was  initially  designed  using  a  top-down  approach  [21].  We 
assumed  a  set  of  dau  structures  similar  to  those  used  by  the  WAM. 
Knowledge  of  possible  compiler  optimizations  was  allied  to  the 
sematuics  of  Prolog  to  decompose  Prolog’s  ge«ral  operations  ituo 
their  components.  These  componetus,  the  abstraa  instruction  seL 
are  the  instructions  attd  addressing  modes  required  to  compile  ndog 
operations  iiuo  efficient  code.  Efficient  translation  of  abstract 
machitK  instructions  ituo  the  architectural  instruction  set  was  a 
prime  influence  in  the  first  pass  of  the  instruction  set  design. 

In  addition  to  our  studies  of  abstract  instruction  sets,  we  ittves- 
tigated  the  microarchitectural  requirements  for  high  performance 
Prolog  (I7j  and  gathered  execution  statistics  for  the  VLSI-PLM,  a 
microcoded  implementation  of  the  WAM  [4].  These  investigations 
pointed  out  th^  microardiitectutal  features  that  would  give  the 
greatest  performance  gains  and  the  Prolog  operations  that  most  need 
instruction  set  support. 


3.1.  Prolog  Inspircil  General  Purpose  bstructlons 

Prolog  inspired  general  purpose  instnictions  are  those  instruc¬ 
tions  which  support  Prolog  ^  which  also  may  be  useful  in  the 
implemetuation  of  other  languages  (Table  2).  These  instnictions 
include  load  and  store  of  immediairs,  sin^e-c^e  double-word  load 
and  store,  and  push  and  pop  memory  operations. 

Immediates  can  be  loaded,  stored,  or  used  in  a  comparison  (Idi, 
m'.  Slid,  empi).  The  immediates  are  tagged  and  are  created  by  sign¬ 
extending  a  12  or  17-bit  immediate  and  replacing  the  four  most 
significaiu  bits  with  an  immediate  tag.  Load  (idi)  is  used 

fiar  creating  iiuegers  and  atoms.  Store  immediate  (»')  is  an  optimi¬ 
zation  of  a  Idi,  St  sequence  and  is  used  to  binl  an  atom  with  a  vari¬ 
able  that  is  known  at  compile  time  to  be  unbob*^ 

Double-word  memory  operations  {Idd,  std,  stdc,  pushd, 
pushde )  are  motivated  by  ftolog’s  laige  memory  bandwidth  require¬ 
ments.  A  double-word  store  or  pudi  is  single-cycle  only  if  the 
source  registers  form  a  consecutive,  evenAodd  register  pair,  hrcauir 
only  three  registers,  two  of  which  must  be  adjacent,  can  be  read  per 
cyde  from  the  register  file.  Although  non-conaecutive  douNe  store 
and  push  (std,pushd)  are  two-cycle  instructions,  this  is  offset  by  dw 


rands  Action 


r(i),  displd.  i(k)  lOt)  *-  M(r<i>»displ6]  (Ul  dutuituisliabU  lo  cache) 
i(k)  <- M[f<i)+»(j)l 

i<i),  i(k),  di9l6  M[f<k>«di9l6] «-  >G)  (sot  dutuiguishable  to  cache) 
i<i).  i(k),  i<l)  M(f(k>+r(l)l«-t<i) 


r(i).  di^ie.  tOt)  t(k)  <-  M[i<i>*<lispl6]:  M[r(i 


add,  sub,  and.  or,  xor 

i<i).r(j).r(k) 

i(k)4-r(i)opi(j) 

add32,sab32 

r(i),r(j).r(k) 

>(k)  *-  i(i)  op  i(j)  (trap  on  signed  32-bu  overfow) 

1 

addi,  andi.  ori,  xcri 

t(i),imml6,t(k) 

tisc)  *-  T(i>  op  itnmlfi 

1 

sll.sra.srl 

r(i),r(j).r(k) 

t<k)  4-  r(i)  op  i(j><4:0> 

1 

sUi,  stai.  srli 

r(i),  immS,  t(k) 

i(k)  4- 1(1)  op  iiiunS<4;0» 

1 

divs,  tnpys 

r<i).r(i).r<k) 

(r(k).  PQ,  TF)  4-  op{r(i).  r(i).  PQ.  TF) 

1 

emp 

cond.  r(i),  t(j) 

TF4-(r(i)condi(j)) 

1 

bt 

addt26 

if  OD  PC<25;0>  4- addt26 

1 

bran 

addr26 

if  (TF)  PC<2S;Q>  4-  addt26;  else  amul  next  instruction 

1 

btai 

addr26 

if  (TF)  (  PC<2S:0>4-  addr26;  annul  next  insnucuon ) 

1 

Jmp 

addr26 

PC<2S:0>4-addr26 

1 

jmpr 

i(i),  displ6 

PC  4-  rO)  4-  displfi 

2 

addt26 

r(31)  4-  PC4-I;  PC<25:0>  4-  addt26 

1 

rd 

s(i).r(k) 

r(k)  4-  s(i) 

1 

wr 

r(i).s(k) 

s(k)4-i(i) 

1 

trap 

immS 

save  PCs  and  PSW;  set  supervisor  bit;  PC  4-  2*(324-inunS<40>) 

6 

rft 

restore  saved  PSW;  fetch  at  saved  PCs 

4 

Table  1 

General  Purpose  Instructions 

Tables  1-3  summarize  the  BAM  processor  instniciion  set,  divided  into  three  groups:  gettenl  purpose.  Prolog-iospited  gen¬ 
eral  purpose,  and  Prolog  qiecific.  The  first  two  columns  give  the  insiroctioa  mnemonic  and  operands.  The  thM  column 
gives  the  instruction's  register  transfer  description.  R(0  denotes  general  purpose  register  i;  s(0  denotes  qtedal  tegisier  i; 
dispn  is  a  sign-extended  n-bit  displacement;  immti  is  a  sign-extended  n-bit  immediate;  addr26  is  a  26-bit  segmem  offset; 
ofn_8  and  o02_8  ate  zertxxtenM  8-bit  displacements;  tag  is  a  four-bit  immediate  tag  value;  and  cond  is  one  of  twenty 
comparison  conditions.  M[x]  is  the  memory  location  at  address  x.  Tag'value  specifies  the  tag  insertion  operation.  Tvar 
represents  the  value  of  the  unbound  variable  tag  (0001).  Cycle  counts  assume  no  pipeline  stalls  due  to  load  or  store  delays. 
All  branch  and  jump  insnuctioas  ate  delayed,  and  the  following  instruction  is  executed  unless  it  is  annulled.  The  cycle  coum 
of  dref  depends  on  the  number  of  memory  operations  (/)  performed. 


absetice  of  a  pipeline  stall  when  they  are  immediately  followed  by  a 
memory  operation. 

Push  instructions  are  included  to  nppon  compound  term  crea¬ 
tion.  Using  branch-and-bound  search  tediniques.  we  determined  an 
optimal  set  of  single-cycle  instructions  for  creation  of  all  possible 
two  and  three-word  structures.  This  set  of  instructions  is  optimal  in 
the  sense  that,  for  our  microarchitecturc.  each  structure  is  created  in 
the  smallest  number  of  cycles.  The  resulting  "compound  term  crea¬ 
tion  instruction  set"  favors  the  idiom  of  placing  two  words  of  data  in 
registers  and  then  moving  them  to  memory  using  a  double-word 
push.  Push  operations  also  allow  the  fill  of  the  cache  litK  from 
memory  to  be  skipped  if  a  push  incuts  a  cacbe  miss  and  also  refers  to 
the  first  word  of  the  cache  liiK  [19].  This  optimization  has  been  used 
in  a  previous  Prolog  design  [S].  The  push  instructions  allow  the 
amount  of  the  increment  to  be  gpedfied  and  aiy  general  purpose 
register  can  be  used  as  a  nack  pointer. 

Prolog  requites  that  variable  assignment  be  undone  on  back¬ 
tracking.  This  unbinding  of  variables  is  implemented  by  recording 
variable  addresses  on  a  “tiai)*'  stack.  The  original  WAM  model 
requires  several  pointer  comparisons  to  determine  if  trailing  is  neces¬ 
sary.  Our  implemetwation  restricts  variatdes  to  the  global  stack 
(abicb  reduces  the  number  of  oompaiisons  to  tme)  and  uses  a  com¬ 
pare  instruction  followed  by  a  conditional  push  (pusht).  The  pop 
instruction  is  used  during  backtracking  to  retrieve  variable  addresses 
from  die  trail  stack.  The  compiler  can  reduce  the  amoum  of  trailing 
and  dettailing  through  the  use  of  flow  analy^  to  determine  when 
uninitiaUzed  vatiaUes  [22]  can  be  used  (our  use  of  uninitialized  vari- 
ables  is  different  finm  [22] — we  use  the  same  tag  fbr  both  initialized 


and  uninitialized  variables  and  determine  at  compile  time  when  des¬ 
tructive  assignmem  is  safe). 

Unsigned  maximum  (umox)  is  provided  to  simplify  the 
rnarugemem  of  the  environmem  and  cteice  poiru  stack  pointers. 
Because  these  stadts  are  intermixed,  allocation  occurs  at  the  max¬ 
imum  of  the  two  stack  pointer  values. 

3J.  Prolog  Specific  Instruction  Set  Support 

Prolog  specific  instructions  are  those  instructions  which  ate 
tailored  specifically  for  efficient  execution  of  Prolog  (Table  3). 
These  instructioiu  support  tagged  pointer  creation,  two  and  three- 
way  branch  on  tag.  pmwrdetefiaeticing.  and  unification  of  atoms. 

3J.1.  Tagged  Data  Support 

Pointer  creation  it  accomplished  by  tire  load  effective  address 
(lea)  instnretion  which  calculates  an  address  and  then  replaces  the 
most  significant  four  bits  with  an  immediate  tag.  This  instruction  is 
used  to  create  poinets  to  unbound  variables  and  compound  terms 
(lists  and  siiuctures). 

Type  checking  built-ins  ate  supported  with  single-cycle 
eompate-and-biancb-cn-tag  insDuctions  (btgtq  and  brgne).  These 
instructions  also  allow  the  compiler  to  repiace  shallow  backtracking 
with  a  ccndlticnal  blanch  on  an  argument’s  tag. 

Prolog  allows  urtbound  variables  to  be  bound  together.  The 
resulting  reference  chain  must  be  dereferenced  before  subsequent 
variable  binding.  WAM  instructiens  always  dereference  their 
operands,  often  resulting  in  superfluous  derefleicncing.  However,  our 
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Ftolog  Instiuctioas 


opdmlziiig  compiler  keeps  trade  of  wUch  variables  are  dereferenced 
and  generates  explidt  dereferences  only  when  necessary.  Imple* 
meriting  dereference  as  a  single  instruction  reduces  static  code  sixe 
and  allows  dereference  memory  reads  to  be  pipelined,  resulting  in  a 
tighter  loop  than  the  equivalent  assembly  o^  [9. 10].  We  use  the 
same  ug  value  for  bofo  unbound  varia^  and  reference  pointen 
(unbound  variables  are  self  referential).  The  dereference  farnruction 
(dr<^  is  im^dememed  as  a  sequence  of  tatemal  opcodes. 

All  of  the  basic  arithmetic  and  compare  instructions  {.add, mb, 
and,  or,  xor,  anp)  have  a  version  whid  traps  on  2S-bit  overflow. 
These  instructions  operate  on  the  foil  32-Wt  word,  but  2S-bit 
overflow  occurs  if  either  of  the  sources  or  the  result  do  not  have 
tanegertags  (0000  or  1111).  The  trap  on  28^  ovetflow  allows  Pro¬ 
log  arithmetic  openalons  to  be  compiled  to  bat,  safe  code  which 
avoids  extra  taistiuctions  for  tag  overflow  checking.  If  a  2S-bit 
overflow  does  occur,  the  trap  routine  can  signal  an  overflow  error  or 
convert  the  dau  into  an  alternative  representation 


3,2,2,  Unification  Support 

Unification  is  one  of  the  pritnary  operatioos  of  Prolog;  it  is 
used  for  argument  passing,  strucnrie  creation,  stnicture  decomposi¬ 
tion,  and  pattern  matching.  Although  general  unification  is  a  com¬ 
plex  algorithm,  if  one  is  given  information  about  the  arguments 
being  unified,  the  general  algoridan  can  be  gready  simplified.  This 
is  cne  of  the  advantages  of  the  WAM  iostiuction  set  over  an  keer- 
pieier.  Our  compiler  takes  this  principk  forther  and  propagates 
infotmadon  10  shnplify  unificadon  as  much  as  possible. 

Analysis  of  the  primitives  necessary  to  support  unification  of  a 
Prolog  variable  with  an  atom  (21]  motivates  the  ataigfe-cycie  utdly- 
tanmedlate  tatsttuction  (rml)  which  binds  the  atom  to  the  variable  if 
the  variable  is  unbound,  and  otherwise  lesa  them  for  equality. 

Unifleadon  of  a  Prolog  variable  with  a  compound  term  also 
benefits  from  special  support  Analysis  of  the  primitives  necessary 
to  support  unifleadonof  a  Prolog  variable  with  a  list  or  struemre  [21] 


Cost  (cycles 


swt  two-wa 


0.8  1.20  1.40 

16.0  1.58  2.32 

8.3  133  1.74 

6.4  1.15  1.37 


MtType(%) 


quick  quick  var 
success  failure  nonvar 


var  recursive 


variable  structure  other 


prover 

meta.qsort 

simple_analyzer 

chai_parser 


Table  4 

WAM  Variable/Compound  Term  Unification  Statistics 

This  table  gives  the  percent  occunenoe  of  the  argument  type  for 
variable/coinpound  term  unification  in  the  WAM  (get_list  and  get_simCture 
instructions).  Columns  2~4  give  the  percent  occurrence  of  variable, 
list/structure,  and  other  types.  The  swt  column  gives  the  average  time  to 
execute  the  three-way  branch  assuming  that  the  execution  times  for  the 
three  directions,  (variable,  list/strucnae,  other),  are  (2, 1, 2)  cycles  respec¬ 
tively.  Likewise,  the  two-way  column  assumes  that  the  three-way  branch  is 
simulated  using  two  two-way  branches  and  that  the  execution  limes  for  the 
three  directions  are  (3, 1, 4).  The  statistics  for  tables  4  and  5  were  gathered 
using  the  VLSI-PLM  [4]  micioarchitectuie  simulator. 

motivates  the  switch-tag  instniction  (swt),  a  thiee-way  branch  based 
on  the  tag  of  one  register.  One  direction  of  the  branch  i'j  taken  if  the 
tag  is  an  unbound  variable;  a  second  direction  is  taken  if  the  tag 
matches  a  specified  immediate  tag  (usually  list  or  structure);  artd  a 
third  direction  is  taken  for  all  other  tags.  The  three-way  branch 
could  be  implemented  using  two  two-way  branches,  however,  WAM 
execution  statistics  (Table  4)  show  that  there  is  a  small  but 
significant  performance  advanuge  to  the  three-way  branch. 

The  LOW  RISC  processor  [8J  provides  a  S-way  branch  and  the 
Caimel-2  processor  [10]  provides  a  10-way  branch  based  on  the  tag 
of  a  single  register.  WAM  execution  statistics  show  that  such  gen¬ 
erality  is  urmecessary  for  unification  of  a  Prolog  variable  with  a  com¬ 
pound  term. 

When  the  compiler  cannot  deteimine  any  information  about  the 
types  of  the  arguments  to  be  unified,  then  general  unification  must  be 
usihI.  In  this  case  one  can  still  take  advantage  of  dynamical  proper¬ 
ties  of  the  argument  types.  The  common  cases  of  general  unification 
should  be  done  quickly  in-litK  and  infrequent  cases  passed  to  a  gen¬ 
eral  unification  subroutine.  Analysis  of  WAM  execution  (Table  5) 
indicates  that  about  70%  of  all  gerreral  unifications  are  simple  bind¬ 
ings  of  an  unbound  variable  with  a  nonvariable.  These  statistics 
motivate  the  switch-bind  instniction  (rwb),  a  three-way  branch 
based  on  the  tags  of  two  registers.  The  conditions  of  the  three 
branch  directions  are:  variableAioiivaiiable,  nonvariablefvariable, 
and  otherwise  (order  of  the  arguments  matters).  This  allows  the 
common  cases  of  variableAionvatiable  and  oonvariable/variable  to 
be  done  in-line.  A  general  unification  subroutine  is  called  for  all 
other  cases.  Note  that  althou^  the  quick  success  and  quick  failure 
cases  are  simple  to  check  for,  their  execution  frequency  is  low 
enough  that  we  have  chi»en  not  to  do  these  checks  in-line. 

The  Pegasus  processor  [9]  supports  gCMial  unification  with  a 
16-way  branch  bas^  on  two  tag  bits  ftoo  each  of  two  registers.  TLe 
LIBRA  processor  [11]  has  a  “paitial  unify'*  instructioa  This 
single-cyde  instruct  performs  either  a  nop,  a  store,  a  call  or  a 
branch  depending  on  the  tags  and  comparison  of  the  two  arguments. 
It  executes  the  variatdeAionvaiiable  case  of  general  unification  in 


prover 

meta.qson 

nmple.analyzer 

chat_paiseT 


Table  5 

WAM  (jeneral  Unification  Siatisucs 

This  table  gives  the  percent  occurrence  of  various  argument  types  passed  to 
general  unification  in  the  WAM  (get_value  and  unify_value  insiiucbons). 
In  the  quick  success  column  both  arguments  are  ide^cally  equal.  In  the 
quick  failure  column  both  arguments  are  tKXivariable  and  have  unequal  tags 
or  both  are  atomic  and  are  unequal.  In  the  var/nonvar  column  the  fim  argu¬ 
ment  is  a  variable  and  the  aecond  is  a  nonvariable.  Likewise,  in  the 
nonvar/var  column  the  first  argument  is  nonvariable  and  the  second  is  vari¬ 
able.  In  the  var/var  column  both  arguments  are  variable.  The  last  column 
contains  the  lemaining  cases  which  must  be  passed  to  a  recursive 
unification  subroutine. 

four  cycles  (not  counting  dereferencing  of  die  arguments).  Using 
switch-bind  (swb ).  BAM  executes  this  case  in  five  cycles.  Although 
the  partial  unify  instruction  of  the  LIBRA  has  a  slight  petfoimance 
advantage,  its  complexity  does  not  fit  with  our  goal  of  minimally 
extending  a  general  purpose  arch  ’ecturc. 

4.  Compilation  of  Prolog 

A  significant  aspect  of  our  project  was  the  simultaneous 
development  of  an  optimizing  Prolog  compiler  [21,23],  The  com¬ 
piler  incorporates  techniques  for  determinism  extraction  and  use  of 
destructive  assignment  The  compiler  accepts  standard  Prolog  and 
produces  code  for  a  simple  non-WAM  abstract  machine.  Although 
the  compiler  uses  stacks  and  data  stnicmres  similar  to  WAM  imple¬ 
mentations,  it  does  not  use  the  WAM  during  compilation,  but  instead 
directly  compiles  to  its  own  abstract  machine.  Automatic  mode  gen¬ 
eration  (type  inferencing)  is  implemented  using  abstraa  inteipreta- 
tion  [24].  It  derives  ground,  uninitialized  variable  [22],  and  derefer¬ 
ence  modes.  Optimizations  are  still  being  implemented,  and  we 
expect  our  performance  numbers  to  improve  compared  to  the 
numbers  list^  in  the  following  sections. 

(Compilation  of  Prolog  is  done  in  three  stages.  FirsL  the  com¬ 
piler  produces  code  for  its  abstract  machine.  Second,  this  code  is 
macio<xpanded  into  the  BAM  instruction  set  Finally,  the  BAM 
code  is  optimized  by  a  peephole  optimizer  and  instnictitm  reordering 
stage  that  maximizes  die  use  of  die  double-word  bus  and  minimiz^c 
the  number  of  nops  and  {Hpeline  stalls. 

5.  Cost/Beneflt  Analysis  of  Architectural  Features  and  Instruc¬ 
tions 

In  section  3  we  motivated  our  instniction  selection  based  on 
several  sources  of  infonnation:  wotk  on  abstraa  instruction  sets  for 
compilers,  bottom-up  analysis  of  microaicfaitectunl  requirements  for 
high  peifoimsnce  Prolog,  and  snalysis  of  WAM  execution  statistics. 
In  this  section  we  give  a  more  rigorous  validation  of  the  aidiitBctunl 
design  and  instniction  selection  by  analyzing  the  cost  md  perfor¬ 
mance  benefits  of  each  special  purpose  feature  and  kistiuction. 
There  has  been  some  woik  to  detontine  such  results  for  other 
designs  [9, 10.  IS],  but  no  oomplete  nalysis  has  been  done. 

5X  Cost  of  Features 

Table  6  shows  the  implementation  coet  of  those  features  which 


Feature 

Active  area 

HHijyiiuiuiiiitiijils'JIH 

Instructions  affected 

4.8% 

2.2% 

Idi,  empi,  sti,  Sid,  lea.  uni 

double-word  memory  pon 

1.9% 

95%  compiled;  S%  by  hand 

Idd,  std.  stdc,  pushd,  pushde 

fast  tag  logic 

1.6% 

btgeq.  btgne,  swt,  swb,  dref,  uni 

multi -cycle/conditional 

0.1% 

100%  compiled 

SIX,  std.  pushd,  pusht.  dief.  uni 

tag  overflow  delect 

*0.0% 

cmD28,  add28,  sub28,  and28.  ar28,  xor28 

10.6% 

99%  compiled;  1%  by  hand 

Tabled 

Cost  of  Special  Afchitectunl  Features 

For  each  special  feature  of  the  BAM  processor,  this  table  gives  the  percentage  of  active  area  (transistors  and  wires)  required 
to  implement  the  feature,  the  design  oomplesity  of  the  layout,  and  a  list  of  instructions  which  depend  on  the  feature.  The 
design  complexity  is  given  as  a  percentage  of  the  layout  that  was  automatically  generated  (using  tilers,  tauten,  etc.)  and  the 
percentage  that  was  laid  out  by  hand.  *100%  compiled  indicates  that  less  than  30  gates  were  placed  by  hand.  Multi¬ 
cycle/conditional  is  a  subset  of  internal  opcodes — the  0.1%  active  area  tefen  to  the  entire  intemal  opcode  inqrlementatioa. 


extend  the  BAM  beyond  a  general  purpose  architecture.  Implemen¬ 
tation  cost  is  expressed  in  terms  of  chip  area  required  to  implement 
the  feamre  and  in  terms  of  VLSI  design  effort  requited.  The  chip 
area  is  measured  in  percent  of  total  active  area  which  includes  ooth 
transistor  and  wiring  area.  The  chip  contains  approximately  1  lOXXX) 
transistors,  and  the  total  active  area  is  91  square  millimeters  using 
1.2  (1  CMOS.  The  VLSI  layout  was  done  using  a  symbolic  layout 
editor  with  custom  design^,  parameterized  cells.  The  btulding 
blocks  were  assembled  into  larger  units  using  a  datapath  compiler, 
PLA  compiler,  tiler,  and  router.  The  design  effon  for  each  feature  is 
given  as  a  percentage  of  its  design  that  was  automatically  performed 
by  the  design  tools.  The  last  column  of  Table  6  lists  thore  instiuc- 
tions  which  depend  on  a  given  feature.  We  do  not  give  each 
feature's  effect  on  the  cycle  time,  sitioe  the  microarchitecture  and 
logic  designs  were  done  carefully  to  prevent  these  features  from 
being  on  the  critical  path. 

Segmera  mapping  requires  the  greatest  area  of  the  special 
features.  This  area  is  primarily  due  to  the  32  by  24-bit  register  file 
which  contains  the  segment  map.  This  register  file  is  used  to  extend 
the  address  space  as  well  as  perfoim  tag  mapping.  A  smaller  register 
file  tailored  to  tag  mapping  alone  would  take  less  area.  The  next 
greatest  area  consuming  feature  is  the  tagged-immediate  generation 
circuitry.  This  is  due  in  part  to  the  use  of  three  distinct  instruction 
formats  for  tagged-immediates.  The  double-word  memory  port 
requires  extra  ports  on  the  general  purpose  register  file  to  support  the 
increased  bandwidth.  The  area  listed  is  the  difference  in  size 
between  our  four/five-port  register  fide  and  the  more  usual  three-port 
register  file.  The  extra  pads  required  by  the  double-word  bus  are  not 
included  in  the  cost.  AJter  the  fest  tag  logic,  the  remaining  features 
use  a  very  small  portion  of  the  total  active  area, 

S,2.  Benefits  of  Features 

To  determine  the  performance  benefit  of  each  feature,  we  cal¬ 
culated  the  cycle  count  increase  caused  by  omitting  the  use  of  all 
instructions  that  depend  on  the  fieature  [23].  For  example,  if  omitting 
the  instructions  Ud,  std,  stdc,  pushd,  and  pushde  increases  execution 
time  from  1(X)  cycles  to  11 1  cycles,  then  the  perfbnnanoe  benefit  due 
to  the  double-word  memory  port  is  11%.  An  instruction  it  omitted 
by  rqrlacing  it  with  hs  macro-expansian  into  simpler  instructions. 
An  ^ort  was  made  to  dwermlnff  optimal  expa^ons,  and  after 
macro-expansion,  peephole  optimization  and  iutniction  reordering 
ate  petfinmed.  Ornish  of  segment  mapping  requites  that  ex^cit 
instructions  be  inserted  to  mask  tig  bits  ^ore  tagged-poinien  are 
used  as  addresses.  A  detailed  description  of  the  analysis  techniques 
is  given  in  [26], 

Table  7  lists  the  performance  benefit  of  the  features  given  in 


Table  6.  Fast  tag  logic,  double-word  memory  port,  tegmem  ttup- 
ping,  multi-cycle  support,  and  tagged-immediate  support  are  con¬ 
sistently  important  features.  Tag  overflow  detection  is  important 
only  in  programs  which  make  heavy  use  of  integer  arithmetic.  The 
overall  Prolog  support  column  is  determined  by  using  only  the 
instructions  from  Trfole  1  (and  non-tagged  versions  of  Idi  and  anpi), 
omitting  segmem  mapping  and  all  insttuctions  in  Tatdes  2  and  3. 

To  summarize,  die  specialized  support  added  for  Prolog  does 
not  require  unreasonable  amounts  of  chip  space  or  hand  layout^l  1% 
active  area  for  all  Prolog  related  fmuures),  and  it  provides  a  perfor¬ 
mance  benefit  of  70%. 

5J.  Benefits  of  Individual  Instructions 

Table  8  provides  a  similar  analysis  apfdied  to  individual 
instructions  or  instruction  groups,  ra^r  than  to  architectural 
features.  Significant  (greater  than  one  percem)  performance  benefit 
is  obtained  from  a  majority  of  the  special  purpose  instiuctioas  (.drtf, 
uminlumax,  lea.  piuh/d/c,  swt,  and  lageqine).  The  multi<ycle 
pointer  dereference  instruction  idrtf)  has  an  average  execution  time 
of  1.6  cycles.  Macro-expansioa  of  drtf  into  an  explicit  loop 
increases  the  average  dereference  time  to  2.2  cycles.  Although  the 
benefit  of  drtf  per  dereference  is  only  0.6  cycle,  the  total  perfor¬ 
mance  benefit  is  significant  because  of  its  fiequent  use.  Some  of  ttte 
smaller  benchmarks,  however,  show  rx>  benefit  for  drtf  due  to  the 
comfdete  elimination  of  dereferencing  by  compiler  optimizatioiL 
Unsigned  maximum  (ismax)  is  used  during  environmem  and  choice 
point  cicatioiL  Omission  of  timox  causes  the  lime  to  determine  the 
top  of  stack  10  increase  fiom  one  to  three  cycles.  Tagged-pointer 
creation  (lea)  is  a  firequent  operation,  and  ia  omissioo  adds  an  extra 
cycle  for  tag  insertion  (using  or).  EUmination  of  auto-increment 
addressing  (push,  pushd.  pushde)  requires  one  extra  cycle  for  eadi 
(dock  aUocation.  The  Hitk- way  bian^  on  tag  (nvr)  can  be  replaced 
by  two  btgeq  instructions,  addi^  an  extra  cycle  to  two  of  the  biandi 
directions.  Elimination  of  the  two-way  brmch  on  tag  (btgeq/ue) 
would  require  a  two  instruction  compare  and  branch. 

The  remaining  instructions  have  less  than  one  percent  average 
performance  benefit  Because  the  VLSI-PLM  spends  about  S%  of  its 
time  trailing  variatde  addresaes,  we  included  special  suppon  in  the 
BAM  (pushs).  However,  due  to  the  compiler's  use  of  uninitialized 
variables,  which  do  not  have  to  be  trailed,  trailing  rinw  is  reduced  to 
1.4%  in  the  BAM.  Omitting  pusht  cauaes  a  dow  down  of  0.7%, 
which  conesponds  to  trail  time  inerrasing  flan  2  to  3  cycles.  Prelim¬ 
inary  analy^  using  maao-expanded  WAM  fbr  the  chaLpaner 
benchmark  indicaiBd  that  the  benefit  fat  pap  would  be  U%.  Com¬ 
piler  optfaniaatioo  of  trailing  has  reduced  thit  result  Similaily,  com¬ 
piler  optimization  reduces  the  number  of  general  unifications. 


Table? 

Perfonnanoe  Benefit  of  Special  Aichitectuial  Features 
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Tables 

Perfijimanoe  Benefit  of  Individual  Instiuctions 

Tables  7  wd  8  gi'W  d*  pereem  perforaunoe  benefit  for  each  special  featote  and  insinicticm  cf  Ae  BAM  processor.  The  last 
cdumn  of  Table  7  lists  the  perfonnanoe  benefit  of  segment  map^ng  and  all  insouctions  given  in  Tables  2  and  3.  Avenges  are 
calculated  using  only  the  last  four  benchmaifcs  wbicb  are  representative  of  well  wrinea,  medium  siaed  (100-1000  line)  Prelag 
ptognms.  All  beachmarits  are  compiled  with  automatic  mode  generation,  and  cache  effectt  are  not  incladed. 


minimiiing  the  benefit  of  swb.  Our  taridal  studies  also  overes¬ 
timated  tbe  benefits  of  vedal  suppoit  for  unificaiion  of  atoms  (iini, 
sd,  stUt).  Ahhough  puiht,  swb,  pop,  mM,  sd,  and  add  provide  mar¬ 
ginal  perfinmanoe  bmfit,  their  impfemeotaiioa  uses  only  features 
already  requited  by  other  instnictions. 

An  interesting  oonciushm  about  the  number  of  directions 
needed  in  mtild-way  branches  can  be  made  fiom  these  measure- 
mentt.  Muhi-way  branches  are  implemented  in  the  BAM  with  die 
SHt  and  swb  instruedons,  which  are  both  single-oyde  three-wsy 
branches  CTsble  3).  5wr  is  used  for  unificaiion  of  compound  terms, 
for  which  giesier  than  a  three-way  branch  is  not  needed  CTable  4  and 
(21]}.  Swb  is  used  for  mdllcatlon  of  terms  whose  types  are  unknown 
atoompaedme.  It  takes  care  of  70%  of  theae  cases  (Table  3),  whidi 
gives  an  0.6%  execution  dme  improvemenc  (Table  8).  If  aome 


sin^e-cycle  branch  took  care  of  100%  of  these  cases,  we  calculate 
the  foitfaer  improvemem  would  be  about  0.7%.  Given  die  addidonal 
cmnplexity  that  such  a  branch  implies,  we  conclude  that  a  muhi-way 
branch  with  more  than  three  direcdoos  is  not  effective  for  Prtdog. 

6.  Performance  Results 

Table  9  compares  the  perfonnanoe  of  the  BAM  processor  to 
that  of  other  Prolog  systems.  Tbe  results  for  BAM  are  shnulsted 
assuming  a  30  MHz  dock  and  indude  oveihead  due  to  cache  misses 
(19].  The  simulated  system  has  128  KB  instruction  and  data  cadies. 
The  caches  are  direct  nuqtped  and  use  a  wrhe  back  policy.  They  are 
nm  in  wann  start,  that  is,  each  bencfamaifc  is  run  twice  and  die  results 
of  the  fim  nm  are  ignored.  Cache  effects  are  significant  only  for  dm 
last  five  programs  in  Table  9.  ITie  cache  oveihead  is  greatest  for 
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0.468 
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mu 
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- 

1.02 
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24.1 

062) 

6.83 

(7.41) 

- 

1.07 

(1.16) 

0.921 

(1.00) 
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73.7 

(65.1) 

28.8 

(25.4) 

- 

1.88 

(1-66) 

1.13 

(1.00) 

meta.qson 
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(49.0) 
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- 

5.25 

(1.11) 

4.71 

(1.00) 
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(19.0) 

- 

- 
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(1.10) 

33.4 

(1.00) 

poly.lO 
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(1.00) 
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Tabic  9 

Peifon&ance  Results 


Hiis  table  compares  the  perfonuance  of  BAM  with  that  of  several  other  Prolog  impiefflentatioos  for  which  benchmaric  results  are 
available — Quintus  ftolog,  the  VLSI>PLM,  and  the  KCM.  Each  result  is  presented  as  a  tune  in  milliaeconds  followed  in 
parentheses  by  the  ratio  to  the  best  BAM  time.  The  Quintus  Prolog  results  are  for  compiled  code  executing  under  Quintus  Prolog 
Release  2.0  on  a  Sun  3/60.  The  VLSI-PLM  (4]  results  are  simulated  assuming  a  cycle  time  of  100  ns  with  no  cache  misses.  The 
^  KCM  results  (6]  are  derived  fiom  actual  measurements  of  a  system  with  a  cycle  time  of  80  ns.  The  BAM  results  are  simulated 

assuming  a  30  MHz  clock  and  128  KB  instructioo  and  data  caches  (19],  FOr  BAM,  the  auio  inodes  and  no  nudes  columns  give 
results  with  and  without  automatic  mode  generation.  Results  are  presented  for  the  well-known  Warren  benchmarks  (the  fim 
eight  in  the  table),  of  which  query  is  modified  to  use  integer  division  in  place  of  the  original  floating  point;  for  mu,  which  proves  a 
theorem  of  Hofstadter's  ‘’mu-m^";  for  piaver,  a  simple  theorem  prover,  for  queens_8.  which  solves  the  eight  queens  problem 
using  an  incremental  generate-and-iest  strategy,  for  tneta_qsott,  a  meta-inuupreter  nmning  Wanen's  qsott  for  tanple.analyzer,  a 
flow  analyzer  analyzing  Wairen's  qson;  for  poly.lO,  which  symbolically  raises  a  polynomial  to  the  tenth  power  for  tak,  which 
executes  recursive  integer  arithmetic;  and  for  chai_parser,  which  parses  a  set  of  English  sentences.  Further  infarmaiian  about  the 
%  benchmarks  may  be  found  in  (28].  The  benchmarks  are  available  by  anonymous  ftp  from  arpa.berkeley.edu. 


simple.analyzer,  poIy.lO,  and  tak;  for  these  ptograms  the  ovetiiead 
ranges  from  1 1%  to  38%.  For  meta_qson  and  cbat_paiser  the  over¬ 
head  is  less  than  3%. 


Although  programs  are  usually  compiled  with  automatic  mode 
generation,  we  have  included  numbers  without  modes  to  show  the 
effect  on  performance.  The  average  performance  improvemem  due 
to  automatic  mode  generation  is  1.44.  The  number  is  bigher  for 
some  of  the  smaller  benchmarks  because  mode  generation  is  able  to 
do  a  better  job  for  them.  For  example,  the  qsoit  and  queens_8  berxh- 
matks  perform  well  because  the  mode  information  allows  the  com¬ 
piler  to  elimiiute  most  choice  poim  creation  and  rqilace  variable 
binding  with  destructive  assigraneriL  The  number  is  lower  for  the 
simple.analyzer  bencbmark  because  it  naea  built-in  predicates 
heavily. 


The  KCM  [6],  one  of  the  best  WAM  implementations,  has  a 
relatively  large  amoum  of  apecializrd  hardware  to  execute  a  WAM- 
like  instruction  set  efficiently,  whereas  the  BAM  processor  uses 
modest  hardware  to  support  an  optimizing  compiler.  We  firxl  that 
the  speed  advantage  of  die  BAM  over  the  KCM  is  equal  to  or  greater 
than  the  cycle  time  ratio. 

A  common  measure  of  Prolog  speed  is  logical  inferences  per 
second  (UPS),  bi  general  this  quantity  is  ambiguous;  however,  it  is 
well  defined  for  the  naive  reverse  benchmark.  The  execution  time 
for  naive  reverse  with  automatic  modes  CFable  9)  gives  a  perfor¬ 
mance  of  3.68  million  UPS. 


Table  10  compares  the  static  code  sizes  of  the  BAM,  the  KCM 
(6],  and  the  SPUR  [27]  relative  to  the  PLM  [3].  Macro  expansion  of 
WAM  code  into  SPUR  instructions  causes  the  large  code  size  of  the 
SPUR.  Static  code  size  for  the  BAM  is  surprisingly  small,  only 


slightly  larger  than  that  of  the  KCM.  This  is  due  to  direct  compila¬ 
tion  into  simple  instruciions,  the  success  of  flow  analysis  in  reducing 
code  size,  and  the  appropriateness  of  the  BAM  instruction  set  for 
Prolog. 


BAM /PLM 

KCM/PLM 

SPUR/PLM 

bytes 

3.1 

3D 

14.1 

instructions 

2.6 

1.1 

12.0 

Table  10 

Static  code  size  ratios 

This  table  gives  the  sialic  code  sizes  of  the  BAM,  the  KCM,  aid  the 
SPUR  relative  to  the  PLM,  a  micro-coded  impleaieniation  of  the  WAM 
[3].  The  BAM  code  size  is  caIcnIatBd  from  prover,  iiieia_qsott, 
simple.aiialyzer,  and  ciia|_pmer.  The  KOI  code  size  is  fiom  (6]. 
The  SPUR  code  size  is  ftcm  [27]. 


7.  Condusioiu 

The  primary  goal  of  our  research  has  been  to  determine  a 
minimal  set  of  extensions  to  a  general  purpose  archiiecnire  necessary 
for  achieving  Ugh  performance  logic  programming.  At  the  same 
dme.  however,  perfbrmance  of  the  gen^  purpose  architecture  has 
not  been  compromised.  We  have  identified  tagged-immediate  sup- 
poit,  aegmem  mqiping,  donUe-word  memory  bus,  special  logic  for 
fost  branch  on  tag,  and  multi-cycle  Inatnictioa  support  as  impottara 
Prolog  spedflc  features.  Our  measarements  Justify  the  utility  of 
push,  pointer  dereferenoe.  and  tagged-pointer  creation  insiiuctioos. 
Our  special  instructions  for  trailing  and  unification  of  atoms,  how- 


ever,  are  of  marginal  bene6L  Finally,  we  conclude  tbat  a  multi-way 
branch  with  more  than  three  directions  is  not  effective  for  Prolog. 

We  have  demonstrated  that  one  can  extend  a  general  purpose 
architecture  to  include  explicit  suppon  for  symbolic  languages  such 
as  Prolog  with  modest  increase  in  chip  area  (11%)  and  yet  attain  a 
signihcant  perfoimance  benefit  (70%). 
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Abstraa 

The  rapid  prototyping  of  microprocessors  requires  a  high  level  of  automation.  An 
environment  suitable  for  developing  application  programs  which  accelerate  the  design  pro¬ 
cess  should  provide  an  efficient  method  for  manipulating  data  and  a  powerful  programming 
environment.  This  paper  describes  the  benefits  we  have  discovered  by  using  PROLOG  as  the 
foundation  for  ASP,  a  suite  of  CAD  tools  tailored  towards  the  automatic  generation  of 
microprocessors.  PROLOG  provides  an  inherent  relational  database  which  is  ideal  for 
describing  and  manipulating  a  host  of  elements  at  all  phases  of  a  design,  from  a  behavioral 
description  to  a  circuit  layout.  PROLOG  also  lends  itself  to  heuristical  as  well  as  algo¬ 
rithmic  programming  styles. 

1.  Introduction 

There  are  many  characteristics  inherent  to  data  elements  in  Computer  Aided  Design 
(CAD)  that  make  them  difficult  to  represent  in  a  database  [1-2].  The  difficulty  lies  in  express¬ 
ing  the  many  different  relationships  between  elements.  For  example,  a  wire  element  may  be 
related  to  other  wire  elements  by  node,  by  layer,  and  by  location.  A  CAD  tool  should  be  able 
to  generate  a  set  of  elements  by  any  of  these  relations.  This  paper  will  show  that  the  relational 
database  inherent  in  Prolog  is  well  suited  for  the  requirements  of  a  CAD  database.  An  imple¬ 
mentation  of  objects  which  cover  the  entire  design  process  is  presented. 

Although  some  CAD  problems  are  well  understood,  most  of  the  problems  in  CAD  are 
only  partially  understood  or  not  well  defined.  Problems  of  this  nature  are  solved  by  employ¬ 
ing  heuristics  such  as  simulated  annealing,  and  rule  based  expert  systems.  Problems  that  are 
well  understood  such  as  the  simulation  and  charmel  routing  are  solved  by  proven  algorithms. 
Problems  that  are  partially  understood  may  have  heuristics  imbedded  within  algorithms.  Pro¬ 
log  supports  both  algorithmic  as  well  as  heuristic  programming  techniques  which  make  it  an 
ideal  candidate  for  CAD  programming.  This  paper  will  illustrate  many  of  the  Prolog  pro¬ 
gramming  techniques  employed  in  ASP. 

ASP  [3]  is  a  lull-range  synthesis  system  tailored  for  the  development  of  microproces¬ 
sors.  It  produces  VLSI  masks  from  instruction  set  architecture  specifications  written  in  Pro¬ 
log.  The  system  is  composed  of  several  hierarchical  components  that  span  behavioral,  cir¬ 
cuit.  and  geometric  synthesis.  Behavioral  descriptions  are  transformed  into  register  transfer 
level  descriptiois  by  VIPER  [4].  Controller  and  daupath  are  realized  in  sticks  by  a  suite  of 
layout  tools  in  VENOM.  The  blocks  ate  compacted,  idaced,  and  routed  by  Sticks  Pack  [5]. 

This  paper  will  reveal  some  of  the  problems  associated  with  representing  data  for  CAD 
while  illustrating  tiie  solutions  that  we  have  discovered  using  Prolog.  An  applicatirm  of  these 
philosophies.  Sticks  in  Prolog  (SIP)  is  explained  in  detail  and  the  other  abstract  levels  in  ASP 
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are  introduced.  Advantages  for  using  a  clause  based  language  for  CAD  development  will  be 
presented  by  describing  the  programming  methodology  employed  by  ASP. 

2.  Design  Considerations  For  Implementing  CAD  Objects 

To  model  the  many  complex  CAD  smicturcs  as  well  as  the  relationships  between  struc¬ 
tures,  many  CAD  environments  use  object  oriented  databases.  CAD  elements,  whether  they 
be  geometry  for  a  compactor,  transition  states  for  a  simulator,  or  logic  expressions  for  a  logic 
minimizer  can  all  be  expressed  in  terms  of  objects.  Them  are  two  strategies  for  representing 
CAD  elements  as  objects. 

In  one  approach  the  database  provides  a  set  of  primitive  objects  (objects  such  as 
polygons,  propeities,  containers,  and  paths)  that  model  CAD  relationships  with  a  representa¬ 
tion  policy.  For  example  a  container  object  can  be  used  to  describe  a  common  node  relation¬ 
ship  by  placing  all  objects  belonging  to  a  node  within  the  container  object  Similarly,  a  com¬ 
mon  layer  relationship  can  be  represented  by  placing  all  objects  that  share  a  common  layer 
within  the  container  object  [6].  The  primative  objects  must  be  capable  of  representing  every 
data  element  and  relationship  that  will  be  necessary  for  any  design.  A  policy  to  represent 
CAD  elements  with  the  primative  objects  must  be  chosen.  There  may  be  several  possible 
representations  of  an  element  within  a  given  set  of  objects.  For  example  given  a  data  objea 
of  type  BOX  containing  four  integer  value  fields,  a  box  can  be  represented  as  a  center  coordi¬ 
nate  with  width  and  length  measurements  as  in  CIF,  or  as  a  pair  of  coordinates  denoting  two 
opposite  comers.  Relationships  between  objects  must  be  explicitly  defined.  Once  esta¬ 
blished,  all  CAD  applications  must  adhere  to  the  well  defined  set  of  policies. 

In  another  approach,  each  CAD  element  is  expressed  as  an  object  [7-8].  For  example, 
elements  such  as  wires,  nets,  conucts,  transistors,  and  waveforms  are  all  expressed  as  tailored 
objects.  Relationships  can  be  expressed  implicitly  within  the  objects  by  adding  data  fields. 
For  example,  a  wire  element  may  contain  a  field  describing  the  layer  of  the  wire  or  by  a 
pointer  to  another  of  the  same  layer.  With  this  methodology,  the  representation  policy  is  dee¬ 
ply  imbedded  within  the  data  objects.  Provisions  must  be  made  for  adding  new  objects.  For 
example,  assume  that  a  system  tailored  for  CMOS  circuits  must  be  modified  to  handle  bipolar 
transistors  for  a  BIMOS  circuit.  If  the  data  fields  chosen  for  the  transistor  element  are  inca¬ 
pable  of  representing  the  bipolar  transistor,  a  new  dau  type  must  be  added  to  the  system. 
Furthermore,  all  programs  that  process  transistors  must  be  modified  to  support  the  new  dau 
type.  The  primary  issue  in  developing  a  set  of  dau  objects  to  represent  CAD  elements  is 
determining  how  much  inherent  support  to  offer  [9]. 


2.1.  PROLOG  as  a  Database 

Relationships  between  the  elements  can  be  expressed  in  terms  of  groups.  For  example, 
elements  in  a  cell  can  be  grouped  by  node,  by  location  and  by  layer.  Current  object  oriented 
databases  for  CAD  have  strict  set  relations  [6-8].  For  example,  many  daubases  categorize 
wires  by  layer,  but  not  location.  To  find  wires  of  the  same  layer,  one  simply  calls  a  generator 
that  returns  instances  of  wires  that  are  of  the  queried  layer.  But  to  find  wires  of  the  same 
grid,  one  cannot  simply  generate  wires  based  upon  the  grid  information,  but  must  generate 
wires  by  layer  and  filter  out  the  wires  that  are  not  of  a  common  grid.  Dau  in  Prolog  is  linked 
by  structure  and  by  value.  Thus,  the  procedure  for  generating  all  wires  on  the  metal  1  layer  is 
the  same  as  the  procedure  for  generating  all  wires  on  row  5,  or  generating  all  wires  of  node 
vdd,  or  generating  all  the  wires  of  row  5  and  node  vdd  in  metal  1.  Prolog  also  provides 


structures  such  as  binary  trees  and  sorted  lists.  These  constructs  make  accesses  to  the  ASP 
Prolog  database  very  uniform. 

In  ASP,  each  CAD  element  is  expressed  as  an  object.  Elements  ranging  from 
behavioral  descriptions  of  architectures  to  logic  equations  for  a  module  generator  to  offset 
contacts  in  an  ALU  layout  are  all  directly  expressed  in  and  referenced  through  Prolog.  In 
Prolog  there  is  no  syntactic  or  semantic  difference  between  a  procedure  call  and  a  database 
query.  This  makes  the  introduction  of  new  data  types  very  simple.  Clauses  that  process  new 
data  types  can  be  easily  integrated  into  the  system.  There  are  thirty  different  representations 
of  a  desigt^  each  with  a  set  of  data  objects.  One  of  the  lowest  levels.  Sticks  in  Prolog,  will  be 
described  in  detail  in  the  next  section. 

2.2.  Sticks  in  PROLOG 

Sticks  in  Prolog  (SIP)  is  a  grid  based  sticks  representation  in  Prolog  that  supports 
hierarchy  and  parameterized  elements.  Module  generators  or  human  designers  generate  SIP 
files  which  are  converted  to  mask  geometry  by  the  STICKS-PACK  compactor.  In  SIP,  VLSI 
elements  are  modeled  as  facts.  Attributes  for  the  elements  are  represented  as  atoms  within 
the  facts.  Currently,  the  SIP  language  consists  of  four  facts  representing  VLSI  elements; 


wire(Layer,pl(Xl,  Yl).pt(X2.  Y2).  Width.  Net). 
ctmt(T ype,  pt(Xl,  Yl),  Offset.  Net). 

transistor  (Type,  pt(SXI.  SYJ).  pt(GX2.  CH).  pKDXS,  DY3).  W.  L,  Nets,  Netg,  Netd). 
pin  (pt(Xl,  Yl),  Layer,  Width,  Label,  Cell). 


Layer  are  of  the  atoms:  ml,m2,p,pd,nd 

These  represent  the  physical  layers  of  the  element  (metal  1,  metal2,  poly,  P-diffusion,  or  N- 
diffusion). 

Contact  offsets  are  of  the  atoms:  nw,  nn,  ne.  -ww,  nof,  tt,  sw,  js,  se 
Contact  types  are  of  the  atoms:  mlml,  mipd,  mind,  mlp 

Width,  XY  coordinates,  W,  and  L  are  integers.  Nets  are  atoms  that  represent  the  connectivity 
rxKle  of  the  element  Elements  of  the  same  node  are  electrically  connected.  Nodal  irtforma- 
tion  is  extracted  by  a  net  extracting  program.  pt(X,  Y)  represents  a  point  location  at  (X,  Y). 
Transistors  have  3  point  locations,  one  for  the  source,  one  for  the  gate,  and  one  for  the  drairt 
Each  location  has  a  separate  node. 

Example:  An  Inverter  in  SIP: 


wire(ml ,  pt(0fi),  pt(0J)2,vdd). 
wire(m],pt(OJ),  p^2J)2,vdd). 
wire(ml,  pt(10,0),  pt(J0J)2.vss). 
wire(m],  pt(10,l),  pt(S,I)2,vss). 
wire(mJ,pl(8J),  pt(2J)2,out). 
wire(mJ,pt(6J),  pt(6S)2,oiit). 


wire(p.  pt(8J),  pt(22)J2,in)- 

wire(p.  pt(6j0),  pt(6J)J,in). 

trans(nd, pt(2.J).  pt(22),  pt(22).  4,2,  vdd,  in,  out). 

trans(pd,  pt(8,I),  pt(82),  pt(8J),  2, 2,  vss,  in,  out). 

contfmlpd,  (2,1),  nof,  vdd). 

cont(mlpd,  (22),  nof,  out). 

cont(mIpd,  (8,1),  nof,  vss). 

cont(mlpd,  (82),  nof,  out). 

pin(pt(6,  0),  p,  1,  input,  inv). 

pin(pl(6, 5),p,  1,  output,  inv). 


Different  CAD  applications  often  generate  different  sets  of  elements.  For  example,  the 
simulator  may  generate  all  of  the  elements  that  are  of  nodes  adjacent  to  a  given  node.  The 
compactor  may  generate  all  of  the  elements  that  are  of  the  same  grid  and  layer  as  a  given  ele¬ 
ment.  The  flooiplanner  may  generate  all  of  the  terminals  of  a  given  cell  side.  With  the  SIP 
representation,  data  elements  can  be  generated  by  any  combination  of  characteristics  very 
easily.  For  example  all  of  the  wires  that  are  of  ml  of  ixxle  vdd  which  have  a  width  greater 
than  3  can  be  generated  in  two  lines  of  Prolog: 


■wireiml,  PtJ,  Pt2,  Width,  vdd), 
Width  >  3, 


This  representation  also  allows  fields  to  be  easily  parameterized  within  a  cell.  For  example. 
In  a  cell  definition  we  have  parameterized  an  output  transistor  with  the  statement: 


parameter(outputrans,  pt(2, 3)). 


A  call  to  the  following  clause  would  permit  the  modification  of  the  W/L  ratio  of  any  transis 
tor  that  has  been  parameterized. 


modtsize(Nam£,  Nemv,  Newt):- 
paramaer(Name,  ptfXloc,  Yloc)), 

retraet(trans(Layer,  pt(Sy,  Sy),  pt(Xtoc,  Yloc),  pt(Dx,  Dy),  _,  Ns,  Ng,  Nd))), 
assen(trans(Layer,  pt(Sy,  Sy),  pt(Xloe,  Yloc),  ^Dx,  Dy),  Neww,  Newl,  Ns,  Ng,  Nd))), !. 


modtmt(Nan>e,  Nevtw,  Newt):- 
wriuC  transistor  not  fourth),  1. 


This  flexibility  allows  tools  to  address  and  modify  specific  elements  within  any  context. 
For  example,  a  program  that  tries  to  optimize  the  perfonnance  of  a  circuit  containing  many 
cells  can  do  so  by  adjusting  the  W/L  ratio  of  the  output  transistors.  With  the  output  transis¬ 
tors  parameterized,  the  program  can  reference  the  output  transistors  from  any  cell  simply  as 
"outputrans"  regardless  of  the  transistor’s  environment. 

SIP  provides  an  exceUent  abstraction  of  VLSI  layout  for  an  automated  module  genera¬ 
tors  that  produce  sticks  layout,  for  example,  the  following  clause: 


uuUceinvertCVddgrid,  Vssgrid,  Ingrid,  Outgrid.  Pw.  PI.  Nw.  Nl):- 
Pdgrid  is  Vddgrid  - 1. 

Ndgrid  is  Vssgrid  +  1. 

assert(wire(ml ,  pt(2,  Vddgrid),  pt(2,  Pdgrid),  I,  unk)), 
assert(wire(ml,  pt(2,  Vssgrid),  pt(2,  Ndgrid),  I,  unk)), 
asssrt(wir^mJ ,  pt(l,  Vddgrid),  pt(S,  Vddgrid),  1,  u^)), 
assert(wire(ml,  pt(l,  Vssgrid),  pt(5,  Vssgrid),  1,  unk)), 
asstrt(u/ire(ml,pt(4,  Pdgrid), pi(4,  Ndgrid),  1,  unk)), 
assert(wire(ml .  pt(4,  Outgrid),  pt(5,  Outgrid),  I,  unk)), 
assert(wire(p,  pi(3,  Pdgrid), pt(3, Ndgrid),  1.  unk)), 
assen(wire(p,  pt(0,  Ingrid),  pt(3,  Ingrid),  I,  unk)), 
assert{eont(mld,pt(2,  Pdgrid),  nof,  unk)), 
assert(cont(mId,pt(2,  Ndgrid),  nof,  unk)), 
assert(cont(mld,pt(4,  Pdgrid),  nof.  unk)), 
assert(cont(mJd,pt(4,  Pdgrid),  nof,  unk)), 

asssrt(trans(pd,pt(l,  Pdgrid), pt(2,  Pdgrid), pt(3.  Pdgrid),  Pw.Plsink,  unk,  unk)), 
assertfiran^nd,  pdl ,  Ndgrid),  pt(2,  Ndgrid),  pt(3,  Ndgrid),  Nw.NlMnk,  unk,  unk)), !. 


will  generate  an  arbitrarily  sized  inverter  with  variable  input  and  output  locations.  Nodal 
information  is  deduced  by  the  extractor.  Roms,  PLAs,  and  other  regular  layout  structures  can 
be  generated  in  a  similar  fashion. 

3.  PROLOG  Programming  for  CAD 

There  has  been  a  growing  trend  in  CAD  to  develop  tools  that  use  both  algorithmic  and 
rule-based  programming  styles  [10-11].  Algorithms  are  generally  fast,  but  are  inefficient  at 
handling  problems  that  have  many  special  cases.  Rule-based  systems  are  well  suited  for 
solving  problems  with  many  specif  cases  or  problems  that  are  not  well  defined.  Rule-based 
systems  have  generally  been  dow.  Rules  in  such  a  system  must  be  looked  up  and  efficient 
management  systems  have  not  yet  been  developed.  Many  CAD  problems,  such  as  simula¬ 
tion.  have  algorithmic  solutions,  but  most  problems,  such  as  routing  and  logic  minimization, 
can  be  solved  by  a  host  of  methods. 

Prolog  provides  an  environment  for  both  algorithmic  and  rule-based  programming 
styles.  Several  examples  of  both  styles  have  been  implemented  in  ASP.  An  example  of  how 
simulated  annealing  is  implonented  in  Prolog  is  illustrated  in  the  Appendix.  The  clausal 
nature  of  Prolog  allows  i^es  to  be  easUy  updated  or  modified.  Algorithms  can  also  be 
expressed  in  a  simple  and  intuitive  manner  which  makes  Prolog  a  language  ideal  for  rapid 
prototyping. 

Prolog  source  code  is  t^ically  10-100  times  more  dense  than  C  or  Fortran  source  code 
perfonning  the  same  function.  This  makes  Prolog  systems  much  more  readable  and 


maintainable.  For  a  large  system  such  as  a  silicon  compiler,  this  has  turned  out  to  be  essen¬ 
tial. 

3.1.  PROLOG  Programming  Methodology  Employed  by  ASP 

There  are  three  basic  formats  for  Prolog  clauses  arc  employed  by  ASP; 

Procedural  Clauses:  These  clauses  work  to  achieve  a  certain  value  or  state  without  fail¬ 
ing.  Examples  of  such  clauses  include  arithmetical  fiinctioits  and  list  manipulations. 

/*  The  mindist  routine  fmds  the  minimum  spacing  distance  between  two  objects  of  layer  and 
width.  The  ’space’  routine  returns  the  minimum  spacing  distance  between  two  layers,  and 
the  width  routine  determines  the  minimum  width  of  a  layer  *! 


mindist(Layerl ,  Widthl,  Layer!,  Width! .  Distbetwnobjets):- 
spacefLayerl.  Layer!,  Distance), 
widthiLaverl ,  Wi^hspacel), 

Widthmadl  is  Widthl*Widthspacel , 
width( Layer!,  Widthspace!), 

Widthmod!  is  Width!*Widthspace!, 

Distbenvnobjets  is  Widthmadl  -t-  Widthmod!  Distance. 


Filtering  Clauses:  These  clauses  interpret  a  given  set  of  data  elements  differently  depend¬ 
ing  upon  the  values  of  certain  data  fields.  If-Then,  and  Case  constructs  can  be  expressed 
through  these  clauses. 

I*  checkonstr  determines  how  to  space  two  elements.  Each  sub-clause  filters  out  a  certain 
condition.  If  the  elements  are  on  the  same  row,  the  spacing  is  irrelevanL  If  the  elements  are 
contacts,  they  can  not  be  stacked  upon  each  other  and  must  be  spaced  accordingly.  If  the  ele¬ 
ments  are  not  contacts  and  of  the  same  node,  the  spacing  doesn’t  matter,  otherwise  the  ele¬ 
ments  must  be  spaced  *! 


checkcoratriLayerl ,  Width!,  Nodel,  Rowl, Layer!,  Width!.  Node!,  Row!,  Layer!):- 
Rowl»Row!. 

checkconstr(LayerI ,  Widthl,  Node!,  Rowl,  Layer!,  Width!, Node!,  Row!,  Layer!):- 
contacts(Layerl ,  Layer!), 


checkconstr(Layerl ,  Widthl,  Nodel,  Rowl,  Layer!,  Width!.  Node!,  Row!,  Layer!):- 
Nodel=Node2. 

checkconstr(Layerl,  Widthl,  Nodel,  Rowl,  Layer!,  Width!,  Node!,  Row!,  Layer!):- 


Generator  Clauses:  These  clauses  generate  sets  of  elements  through  backtracking  or  the 
bagof  construct  in  Prolog. 


/*  MaJcebox,  t  routine  that  creates  boxes  from  various  elements,  first  processes  wires,  followed  by  contacts  and 
transistors.  */ 


makebox:- 

wire(Layer,pi(Xl,  Yl),pt(X2,  Y2),  Wid,  Node), 


fail. 

makebox:- 

cont(Type,  pt(Row,  Y),  Oset,  J, 
fail. 

makebox;- 

transCT ype,  pt(Sx,  Sy).  pi(Gx.  Gy),  pt(Dx,  Dy),  W,  L,  Sn,  Gn,  Dn), 
fail. 

makebox. 


4.  Conclusion 

Prolog  provides  a  relational  database  and  a  powerful  programming  enviroiunent.  The 
relational  database  is  easy  to  use,  can  represent  all  CAD  objects,  and  provides  a  flexible  inter¬ 
face  to  the  programming  environment  The  clausal  nature  of  Prolog  provides  an  environment 
suitable  for  algorithmic  and  rule  based  programming  styles.  The  success  of  ASP  has  shown 
that  Prolog  is  a  robust  language  well  suited  for  CAD  development 

This  work  was  sponsored  in  part  by  Defense  Advanced  Research  Projects  Agency 
(DoD)  and  monitored  by  Space  and  Naval  Warfare  Systems  Command  under  Contract  No. 
N00039-84-C-0089. 
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5.  Appendix 


%  Simulated  Annealing  package 

%  You  provide  the  move  set,  stopiping  criterion,  and  number  of  inner  loop  iterations. 

siman(InitTemp,  StaieO,  Cost,  Finalstate,  Finalcost):- 
doOuteiflnitTemp,  StateO,  CostO,  Finalstate,  Finalcost). 

%  Outer  Loop 

doOuter(Temp,  StateO,  Cost,  StateO,  Cost):- 
endhere(Temp,  StateO,  Cost).  %  Outer  loop  complete  by  criterion  eitdhere 

doOuter(0,  Temp,  State,  Cost,  Finalstate,  Finalcost):- 
doIimerfO,  Temp,  State,  Cost,  Newstate,  Newcost), 
updatetempOTemp,  NewT), 

d^uter(NewT,  Newstate,  Newcost,  Finalstate,  Finalcost). 

%InnerLoop 

doInnerfCount,  Temp,  State,  Cost,  State,  Cost):- 
maxinnercoimt(Moount), 

Count  >  Mcount  %  inner  loop  complete 

doInner(Count,Temp,  State,  Cost.  Finalstate,  Hnaloost);- 
gennewstatefSute,  Newstate,  Newcost).  %  create  a  new  state  by  move 
Deltacost  is  Cost  •  Newcost, 
accepi(Deliacost.  T  emp), 

Nextcount  is  Count  *  1, 

doInner(Nextcount,  Temp,  Newsute,  Newcost,  Finalstate,  Finalcost). 

doInnerfCount,  Temp,  State,  Cost,  Finalstate,  Finalcost);-  %  new  state  not  accepted 
Nextcount  is  Count  -t- 1. 

doInner(Nextcount,  T,  State,  Cost,  Finalstate,  Finalcost). 

accept(Deltacost.  Temp):-  %  Good  move 
Deltacost  =<  0. 

accept(Deltacost.  Temp);-  %  Random  factor 
Aexp  is  -l^Deltacost/Temp, 

Afactor  is  exp(Aexp), 
randomCRandnum), 

Randnum  <  Afactor. 

tipdatetempCTemp,  Newtemp):- 
Newtemp  is  Temp  *0.04, 1. 


maxinneicount(100). 
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Abstract 

Oae  of  the  key  steps  in  performance  prediction  of  multiprocessor  systems  using  simulations  is  the 
'Validation  process.  A  step  in  the  validation  process  consists  of  sequential  execution  of  benchmark 
programs  on  the  multiprocessor  simulator  and  a  uniprocessor  simulator,  and  comparing  the  results 
and  performance  measurements  data.  The  simulated  cycle  count,  simulator  overhead,  operation 
count,  and  memory  access  count  are  identified  to  be  the  key  performance  data  needed  for  the 
comparison.  This  process  is  illustrated  using  the  multiprocessor  KuSim  for  the  parallel  execution 
of  Prolog  programs  and  the  uniprocessor  simulator  \Tsim.  For  large  programs,  the  counts  obtained 
from  the  two  simulators  are  within  10%  of  each  other. 
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1  Introduction 


Simulation  is  an  accurate  aud  effective  approach  in  predicting  performance  of  a  ne\v  multi¬ 
processor  system,  raking  into  account  the  many  intricate  details  in  the  hardware  and  software 
designs.  The  degree  of  accuracy  depends  on  how  much  detail  is  included  in  the  simulator.  To 
ensure  that  the  simulator  accurately  reflects  the  real  system  (yet  to  be  built),  the  simulator 
must  be  carefully  validated  for  correct  functional  as  well  as  timing  results. 

The  Vfdidation  process  is  carried  out  primarily  by  comparing  performance  data  from  the 
new  simulator  with  known  data  obtained  from  previously  validated  sources.  The  validation 
process  itself  can  be  quite  tedious  and  difScult.  with  massive  tumounts  of  information  that 
need  to  be  analyzed.  In  this  paper,  we  present  our  approach  to  validation.  The  process 
involves  sequential  execution  of  benchmark  programs  on  the  multiprocessor  simulator  and  a 
uniprocessor  simulator,  comparing  results  and  penormance  data. 


2  Validation  Methodology 


There  are  many  approaches  to  the  validation  of  a  simulation  model  [SarSS].  The  concept  of 
our  approach  to  validation  is  quite  simple:  comparing  new,  unverified  results  with  previous!}’ 
known  answers.  The  more  difScuh  task  is  the  careful  consideration  of  the  many  different 
factors  that  can  affect  the  results  and  the  degree  of  these  effects.  The  validation  process 
for  a  computer  system  simulator  is  best  done  in  a  stepwise  fashion.  The  exact  details  o: 
the  necessary  steps  depends  on  the  availability  o:  the  kmown  result,  or  the  oasu.  nsec  for 
comparison. 


In  this  paper,  the  term  host  designates  the  machine  on  which  the  simulator  is  run  and 
iarpet  refers  to  the  computer  architect ure/system  being  simulated.  \'alidation  refers  to  the 
process  o:  ensuring  that  the  simulator  is  coded  correctly  and  that  it  accurately  models  the 
target. 

In  the  initial  phase,  where  a  paper  desipr  is  the  only  basis  avaiiabie,  validation  of  the 
simulator  usuallv  consists  of: 


1.  Manually  checldng  for  correct  coding  according  to  the  paper  design. 

2.  Pvunning  the  simulator  and  checiung  for  functional  correctness,  comparing 
the  results  v/ith  manually  worked  out  solutions. 

3.  Manually  checking  the  timing  of  sub-blocks  in  the  simulator. 

4.  Running,  the  simulator  to  obtain  timing  estimates. 

0.  Running  simulator  with  instrumentation  turned  on  to  capture  dynamic  exe¬ 
cution  statistics. 


The  term  manually  used  above  refer  to  the  ad  hoc  approach  of  eyebailmg  (for  steps  1 
and  3).  hand  calculations  (step  2).  or  writing  small,  very  special  purpose  software  tools  to 
accomplish  the  taisks.  This  approach  is  very  tedious  and  error  prone,  but  is  often  the  only 
possible  way  at  this  phase  since  a  paper  design  is  the  only  available  basis.  In  the  last  step, 
the  monitor  facility  for  instrumentation  should  not  affect  the  timing. 

Once  the  initial  simulator  is  validated,  it  may  be  used  as  a  basis  for  validating  other 
simulation  systems.  The  validation  process  can  now  be  done  with  a  greater  degree  of  au¬ 
tomation.  and  thus  achieving  greater  efficiency.  However,  great  care  must  still  be  taken  to 
understand  the  factors  that  cause  discrepancies. 

The  validation  process  of  a  multiprocessor  system'  simulator  involves  the  following  steps: 

1.  sequential  execution  on  one  processor.  This  is  done  to  test  the  processor 
module  of  the  simulator  and  the  relevant  support  modules  such  as  assembler 
and  loader. 

2.  parallel  execution  on  one  processor.  This  is  a  degenerate  case,  done  to  mea¬ 
sure  the  overhead  of  parallel  execution. 

3.  parallel  execution  on  two  processors.  This  is  a  special  case  for  testing  in¬ 
terprocessor  communication  with  no  interference  since  there  is  exactly  one 
sender  and  one  receiver. 

4.  parallel  execution  on  three  or  more  processors.  This  is  the  general  case  of 
parallel  execution,  with  potential  for  interference  on  shared  resotirces  such 
as  the  memory  and  communication  channels.  It  is  also  used  to  test  the  full 
extent  o:  the  parallel  execution  model,  .^s  more  processors  are  added  to  the 
connguration,  the  saturation  of  shared  resources  will  occur  and  bottlenecks 
will  appear. 

In  this  paper,  we  present  the  application  o:  the  first  step  of  validation  of  a  multiprocessor 
simulator,  using  a  previous!}'  validated  uniprocessor  simulator  as  a  basis.  Since  there  are 
architecture  and  execution  model  ^■ariations  in  the  two  simulators,  their  results  are  compeired 
for  proximity,  not  for  exact  equality.  The  following  sections  provide  details  on  the  simulators 
and  the  validation  approach. 


3  Simulator  Descriptions 

The  validation  process  is  demonstrated  using  two  simulators:  VPsim  and  NuSim.  Both 
simulators  provide  an  abstract  machine  engine  for  fast  execution  of  the  Prolog  language. 
\PsirD  is  a  pretdousl.v  validated  simulator  to  be  used  as  the  basis  of  comparison  for  NuSim. 

'The  term  mulUproeettor  system  is  used  to  include  both  the  multiprocessor  architecture  and  the  parallel 
execution  model 
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3.1  VPsim 


\Tsim  is  a  register  treinsfer  level  simulator  for  ibe  ^'LSI-PLM  [STN^SS].  This  chip  is  a  \’LS1 
implemetnation  of  a  high  performance  engine  for  Prolog,  a  modified  version  of  the  abstract 
machine  proposed  by  ^^■arren  [^^’a^S3].  VPsim  is  written  in  the  C  lamguage.  consisting  of  >4500 
lines  of  C  code  and  9000  lines  of  microcode  operations  (register  transfers.  CPU  operations 
and  microbranches). 

To  verify  \'Psim's  Junctional  correctness,  a  wide  variety  of  Prolog  programs  were  run 
on  ^’Psim  and  compared  with  those  obtained  from  runs  on  softwa.re  Prolog  environments 
such  as  Quintus  Prolog.  Because  \*Psim  is  microcode  driven,  the  microstates  automatically 
provide  accurate  liming,  with  each  microstate  being  executed  in  exactly  one  processor  cycle. 
Gate  and  transistor  level  simulations  of  the  VLSl-PLM  chip  are  compared  against  the  results 
from  ^‘Psim.  The  fabricated  chip  has  passed  an  extensive  testing  process  and  has  successfully 
executed  a  number  of  benchmark  programs.  Work  is  in  progress  to  interface  the  chip  with 
a  cache  and  memory  board  to  be  used  as  a  coprocessor  for  the  SUN  workstation. 

From  the  perspective  of  this  paper.  \’Psim  is  a  solid  simulator  that  has  been  well  tested 
and  has  been  verified  by  the  hardware.  It  is  an  available  resource  that  can  be  used  as  a  basis 
for  testing  other  simulation  systems. 

3.2  NuSim 


To  carry  out  our  study  in  parallel  execution  of  Prolog,  we  need  an  accurate  and  fiexible  tool  to 
be  used  as  a  testbed  for  new  ideas.  V’e  approach  our  study  from  a  system  designer's  point  o; 
view,  working  with  the  complete  systenn  from  software  execution  model  to  hardw-are  support 
for  high  performance.  We  are  particularly  interested  in  practical  designs  that  can  be  built  in 
reasonable  time.  For  these  reasons,  we  base  our  multiprocessor  study  on  our  knowledge  and 
esmerience  twlth  sequential  execution  o:  Prolog  on  the  ^’LSI-PL^■I.  In  addition  to  the  Prolog 
specific  instructions,  the  chip  contains  a  number  of  simple  general  purpose  instructions  and 
primitive  support  for  synchronization.  T'nis  makes  it  a  good  candidate  building  block  for  a 
multiprocessor  system. 

.4  simulator  can  best  serve  our  interest  in  hardware  support  for  high  performance.  The 
result  obtained  from  a  simulation  run  reSects  a  composite  effect  of  many  intricate  details 
that  can  not  be  easily  formulated  or  calculated.  3y  varying  the  parameters  of  the  simulator, 
the  effect  that  each  parameter  has  on  overall  performance  can  be  meeisured. 

We  have  constructed  a  new  simulation  system,  called  NvSim.  to  facilitate  our  studies  of 
parallel  execution  models  and  the  underlying  multiprocessor  architectures.  This  simulator 
framework  allows  for  the  complete  system  simulation;  from  the  instruction  set  level  to  the 
memort*  architecture  level  with  caches  and  coherency  protocols.  The  key  feature  of  this 
simulator  framework  is  fiexibiliry.  which  allows  for  extensive  instrumentation  and  continual 
updates  and  changes.  The  modular  design  identiffes  main  features  of  the  execution  model 
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Figure  1;  Oven*iev.-  o:  KuSim  Simulator 

and  the  architectures  being  simulated  as  cleanly  separated  modules  tritb  clearly  defined 
interfaces.  This  allows  for  easy  modihcations  to  the  individual  modules  to  support  new 
execution  models  and  axchiieciures. 

KuSim  is  an  event-driven  simulator.  vi*ith  the  events  being  memory  accesses  ordered 
by  time.  This  technique  simulates  a  multiprocessor  using  a  uniprocessor.  NuSim  consists 
of  16000  lines  o:  C  code  and  two  small  machine  dependent  routines  to  save  and  restore  the 
coroutine  siacits.  It  is  fairly  portable,  currently  running  under  4.3  BSD  Unix  on  the  7B5 
and  Sun  3,  and  under  System  Unix  on  an  Intel  3S6  personal  computer. 

Figure  1  shows  the  structure  of  the  KuSim  simulator.  Two  of  the  major  modules  of 
the  simulator  are  the  proctssor  module  and  the  mtmory  sysiem  module.  The  processor 
modtile  emulates  the  VLSI-PLM  instruction  set,  and  is  thus  comparable  to  V*Psim.  The 
memor}’  system  simulates  a  mulii  (3el85]  memory  architecture,  with  each  processor  hanng 
a  local  cache  and  all  caxhes  communicate  with  main  memor}'  via  a  single  bus.  The  caches 
are  kept  consistent  -via  a  hardware  consistcno*  protocol.  In  the  context  of  this  paper,  these 
two  modules  form  the  core  of  the  simulator  xo  which  the  validation  process  is  applied.  The 
question  at  hand  is:  how  weD  does  K'uSim  simulate  a  ^’LSI-PLM? 
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3.3 


Simulator  Differences 


AlriiO'jch  both  aud  \'Psim  essentially  simulate  the  \’LSI-PL?iI  chip,  they  were  cre¬ 

ated  for  very  different  purposes.  ^'Pslm  v.-as  designed  as  a  simulator  for  a  very  specific 
microarchitecture  of  a  Prolog  processor.  Details  of  the  \‘L£I-PLM  microarchitecture  are 
‘■bard-wired"  into  the  microcode,  in  terms  of  what  micro-operatious  are  possible  and  the 
constraints  in  packing  the  micro-operations  into  a  micro-state.  On  the  other  band.  XuSim 
wa.s  couceit'ed  as  a  more  general  purpose  multiprocessor  simulator  for  system  integratior... 
dealing  at  all  levels  from  hardware  architecture  to  software  execution  model.  It  will  be  used 
to  experiment  with  different  architectures  and  execution  model  tradeoffs. 

Because  of  the  different  goals  in  creating  the  simulators,  there  are  a  number  of  differences 
between  them.  These  differences  are  identified  to  help  us  understand  the  differences  in  per¬ 
formance  numbers.  The  following  are  some  differences  between  \‘Psim  and  NuSim  (running 
sequential  code); 


•  simulation  level.  \'Psim  is  a  regisie.’-- transfer-level,  cycle-by-cycie  simula- 
lion.  while  NuSim  is  an  event  driven  simulator  which  step  by  memory  access. 
The  clock  of  \'Psim  is  incremented  each  cycle,  while  the  clock  of  NuSim  is 
incremented  by  a  value  obtained  from  table  lookup. 


•  cdr-coding.  ^*Ps;m  uses  cdr-coding.  while  NuSim  does  not.  Cdr-coding  is 
a  compressed  representation  for  list  elements  stored  in  consecutive  memory 
locations.  It  requires  a  bit  to  indicate  if  the  next  location  is  the  ccr  of  the 
next  element.  Cdr-coding  is  eliminated  because  its  complexity  has  caused 
many  subtle  bugs  in  the  microcode  while  contributing  little  to  the  overall 
nerformance  iDobST'i. 


•  instruction  fetch.  NuSim  does  instruction  fetch  on  demand,  and  accounts 
time  for  all  fetches.  ^'Psim  does  prefetching,  which  does  not  charge  time  for 
all  fetches,  but  may  spend  time  to  fetch  unnecessarily. 


•  memory  system.  Nusim  has  a  cache/memory  system  with  realistic  ^•^lues 
for  memort- access  time.  It  accounts  time  for  cache  misses  and  block  transfers 
from  memory.  ^■Psim  has  single  (processor)  cycle  memory. 


•  Prolog  builtins.  ^■psim  treats  some  Prolog  bvilzins  (language  predenned 
routines')  as  e>n,emal  functions,  and  ships  data  outside  the  VLSl-PLM  pro¬ 
cessor  for  processing  by  the  host.  A  x-arying  amount  o:  time  is  charged  for 
the  data  shipment  (3  to  10  cycles),  but  no  time  is  charged  for  executing 
the  external  function.  ^■Psim  also  implements  some  Prolog  builtins  in  the 
library  using  ^'LSI-PLM  assembly  code.  NuSim.  on  the  other  hand,  executes 
all  Prolog  builtins  inside  the  processor,  and  charges  time  for  them  as  normal 
instruciions.  In  NuSim.  all  builtins  are  written  in  C  code. 


4  The  Validation  Example 


In  this  section,  we  will  compare  the  performance  results  of  XuSim  to  those  of  \'Psim  to  see 
hou-  closely  XuSim  simulates  a  ^’LS]-PL^1  processor.  Man}’  benchmarks  were  run  on  both 
XuSim  and  ^’Psim.  and  their  e.xecunon  outputs  were  compared  for  functional  correctness. 
A  group  of  benchmarks  were  chosen  for  closer  timing  evaluation.  These  benchmarks  differ 
widely  in  static  code  size  and  dynamic  memory  usage  and  execution  time. 

We  have  identified  a  number  of  measurements  for  comparison.  Tbev  are;  static  code  size, 
cycle  count,  simulation  overhead,  operation  count,  and  memory  access  count.  Each  type  of 
measurement  provide  a  different  jierspective  of  the  simulation  results,  helping  to  understand 
the  similarit}’  and  differences  between  the  two  simulators  and  at  the  same  time  validating 
the  results  of  XuSim. 


Table  1:  Bcnchmarh  Coot  £irc.«  ond  Dtfcnviions 


Benchmark 

XS  code 

yp  code 

NS'\‘P  1  Description 

bintree 

ISI 

198 

0.91 

buiid  a  6-node  binary  tree 

compi  1  er -bin  t  ree 

11409 

12485 

0.92 

compiling  the  bintree  program 

compiler  mlml 

11613 

12750 

0.91 

compiling  portion  of  the  compiler 

hanoi 

91 

82 

1.11 

towers  of  heinoi  for  8  disks 

mumath 

262 

251 

1.04 

Kofstadter’s  mumath  problem  for  muii-u 

newchat 

8018 

8446 

0.95 

parsing  sentences  with  the  chat  parser 

nrevl 

164 

109 

1.50 

naive  reverse  a  30-eiement  hst 

palin25 

290 

259 

1.12 

palindrome  for  a  25-character  string 

puzzle 

1158 

1049 

1.10 

solve  a  puzzle 

qs4 

249 

163 

1.53 

quicksort  on  50  numbers 

cs4«meta 

4ST 

397 

1.23 

Prolog  meta  interpreter  running  qs4 

queensS 

295 

304 

0.S7 

5-queens  problem 

reducer 

2017 

2020 

1.00 

a  graph  reducer  for  T-comhinaiors 

sdcc 

1663 

1636 

1.02 

static  data  dependency  analysis 

tal: 

69 

1  i 

0.90 

solves  a  recursively  defined  function 

coni 

52 

46 

1.13 

concatenation  of  3-  and  2-eiemeat  hstsj 

con  6 

55 

45 

1.15 

pair^vise  partition  of  a  5-eiement  list 

nbo 

71 

69 

1.03 

compute  5:h  fibonacci  n’umber 

4.1  Static  Code  Size 

Table  2  shows  the  descriptions  and  the  static  code  sizes  (in  number  of  lines)  for  the  same 
benchmark  compiled  using  different  options  for  execution  under  X'uSim  (XS)  and  \*Psim 
(^"P).  The  three  smallest  benchmarks  (coni.  con6.  and  nbo)  are  listed  separately  at  the 
bottom.  The  ratios  KS/VP  show  that  static  NS  code  and  VP  code  are  for  the  most  part 
well  v.-ithin  10%  of  one  another.  The  ones  that  show  big  t-ariances  are  due  to  the  lack  of 
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cdr-coding  in  NuSim.  which  requires  two  instructions  to  build  an  element  (car  and  cdr)  of  a 
list.  For  example,  nrtvl  builds  a  list  of  30  elements  before  reversing  it  and  q!>4  builds  a  list 
of  50  elements  before  quick-sorting  it. 

4.2  Cycle  Count  (Simulated  Time) 

Columns  VP  cycles  and  NS/VP  cycles  of  Table  2  show  the  c.vcle  count  of  \Tsim  and  the 
ratio  of  NuSim/^'Tsim  cycles,  respectively.  The  hit  ratio  column  shows  results  for  KuSim 
configured  to  a  4-wa}-  associative,  64K  byte  cache  with  a  block  size  of  16  bytes. 

From  these  columns,  we  observe  that: 

•  Simulated  time  of  NuSim  is  quite  comparable  to  VPsim  (column  NS/VP 
cycles  value  is  approximately  1)  for  the  large  benchmarks  (compilermintree, 
compilerjilml.  newchat.  queensS.  reducer,  and  tak). 

•  Nusim  cycle  count  is  worse  than  \*Psim  in  the  small  benchmarks  due  to  low 
hit  ratio  (cache  cold  start).  For  example,  coni.  con6.  and  fibo  have  the  lowest 
hit  ratios  among  the  benchmarks,  measuring  at  8S.T%.  95.77(..  and  95.6%. 
respectively. 

•  Non-cdr  coded  lists  also  contributes  a  little  to  the  degradation  in  performance 
for  a  small  benchmark  such  as  nrevj  v.-hich  has  a  decent  hit  ratio  oi  95.3%. 


4.3  Simulation  Overhead 

.Although  the  time  that  the  simulators  require  to  run  is  largely  independent  o:  the  correctness 
o:  the  results,  it  is  interesting  to  compare  simulation  overhead  of  the  two  simulators  because 
they  simulate  at  two  different  levels  and  follow  different  simulation  methodologies. 

The  following  explanations  refer  to  Table  2: 

•  Column  VP  sysiimt  provides  the  system  simulation  time  (the  time  taken  to 
run  the  simulator  on  the  host  in  seconds),  and  column  NS/VP  sysiime  pro¬ 
vides  the  KuSim  to  VPsim  ratio.  These  numbers  are  obtained  from  running 
simulations  on  a  SUN  3/60  with  16MB  of  memory.  Thes«  values  give  a  feel 
for  the  response  time  of  the  simulators,  ranging  from  .5  sec  to  5920  secs  (or 
1.64  hours). 

•  The  overhead  columns  are  provided  as  the  ratio  of  cycle  count  (discussed 
in  section  4.2)  to  system  simulation  time,  assuming  100ns  cycle  time  for  the 
KuSim  processor  and  the  ^■LSI•PLM  chip.  For  example,  a  value  such  as  2000 


Table  2:  Cycle  Count  and  Simulatiov  Time 


VP 

NS/VP 

NS 

VP 

NS/VP 

\T 

NS/\T 

Benchmark 

cycles 

cycles 

hit  ratio 

sysiime 

systime 

overhd 

overhd 

biniree 

9S75 

1.30 

97.8 

3.5 

1.43 

3544 

1.10 

compiler  Jbintree 

2208006 

0.99 

99.5 

529.5 

0.87 

2398 

0.87 

•compiler^lml 

5997896 

0.89 

99.6 

1426.4 

0.75 

2378 

0.85 

hanoi 

78884 

1.50 

99.9 

21.4 

1.17 

2713 

0.78 

mumath 

96907 

1.26 

99.8 

26.2 

0.92 

2704 

0.73 

newchat 

6911008 

1.09 

99.9 

1315.9 

1.01 

1904 

0.92 

nrevl 

21192 

1.38 

98.3 

6.1 

1.31 

2878 

0.95 

palin25 

25026 

1.08 

98.6 

7.4 

1.08 

2957 

1.00 

puzzle 

39456475 

0.67 

99.9 

5920.2 

0.43 

1500 

0.65 

qs4 

43190 

0.98 

98.9 

11.9 

0.92 

2755 

0.94 

qs4jneta 

348051 

1.17 

98.9 

113.6 

0.65 

3264 

0.56 

queens8 

19759942 

1.04 

100.0 

3354.2 

1.16 

1697 

1.13 

reducer 

2543554 

1.07 

99.5 

439.8 

1.11 

1729 

1.04 

sdda 

85382 

1.14 

98.5 

28.0 

0.93 

3279 

0.82 

tak 

9398259 

0.96 

99.2 

2461.5 

0.02 

2619 

0.65 

coni 

256 

2.96 

88.7 

0.5 

6.00 

19531 

2.03 

con6 

1307 

1.52 

95.7 

0.7 

4.29 

5356 

2.82 

fibo 

2225 

1.4-J 

95.6 

1.2 

2.50 

5393 

1.73 

in  these  columns  means  that  it  took  2000  seconds  of  the  SUK  3/60  time  to 
simvdate  1  second  of  the  ^*LS1•PLM. 

The  worst  numbers  in  the  overhead  columns  appear  in  the  three  smallest 
benchmarks  coni.  con6.  and  fibo.  This  is  due  to  the  initial  overhead  of  start¬ 
ing  up  the  simulators.  Also  in  the  three  smallest  benchmarks,  the  overhead 
of  NuSim  is  much  higher  than  VPsim  (1.73  to  2.82  times  worse).  This  is 
because  NuSim  takes  more  time  to  startup,  being  a  multiprocessor  simulator 
and  having  to  assemble  the  benchmark  into  assembly  code.  For  the  larger 
benchmarks,  the  NuSim  is  more  efficient  than  VPsim.  Excluding  the  three 
smallest  benchmarks,  the  average  overheads  of  NuSim  and  ^*Psim  are  2203 
and  2555,  respectively.  Thus  Nusim  is  16%  more  efficient. 

•  Even  though  NuSim  simulates  the  VXSI-PLM  at  a  slightly  higher  level  than 
the  register-transfer  level  of  VPsim.  it  is  not  that  much  more  efficient  be¬ 
cause  VPsim  microcode  is  ‘■fiat”  while  NuSim  C-routines  are  hierarchically 
structured.  The  cost  of  structured  code  depends  on  the  efficiency  of  the  code 
generated  bj‘  the  C  compiler  for  subroutine  and  returns. 


Simulation  of  the  ^'LSI-PLM  on  a  SUN  3/60  is  more  than  2000  times  slower  than  actual 
execution  on  a  \'XSI-PLM  because  of  the  following  reasons: 


•  Data  and  control  transfers  are  processed  sequentially.  In  a  real  machine,  it 
would  be  done  in  parallel.  The  \'LSI-PLM  has  a  two  stage  pipeline,  with 
the  data  unit  and  microsequencer  executing  in  parallel.  The  ^’LSI-PL^^  data 
unit  is  also  capable  of  doing  S  simultaneous  transfers  in  one  cycle. 

•  The  host  processor  is  less  powerful  than  the  target  processor  for  symbolic 
computation  and  the  host  memory  access  time  is  slower  than  the  target 
memory  access  time.  The  SUN  3/60  that  we  use  has  a  20MHz  MC6S020 
and  16MB  of  main  memory  {300us  access  time).  There  is  no  cache.  The 
\’LSI-PLM  is  a  complex  processor  with  tag  processing  capability. 

•  The  code  generated  by  the  C  compiler  affects  the  execution  time  of  the  host. 

For  example,  inefficient  subroutine  calls  and  returns  penalize  the  hierarchical 
structure  of  NuSim  C  code. 

•  The  presence  of  extensive  instrumentation  code  in  the  simulators  for  extract¬ 
ing  performance  results  slows  down  execution  on  the  host. 

•  The  operating  system  characteristic  of  the  host  can  great}}'  affect  perfor¬ 
mance.  The  SUN  3/60  runs  4.3  BSD  Unix  and  virtual  memory.  The  CPU 
accesses  a  shared  fJe  server  connected  via.  Ethernet,  and  thus  pagefauhs  are 
ver}*  expensive. 

The  factors  above  blend  together  in  the  real  uniprocessor  system  and  it  is  difficult  to 
meaisure  them  separately.  This  is  the  reason  why  a  simulator  is  needed  for  experimentation 
•with  individual  system  parameters.  For  simulating  a  multiprocessor  configuration,  the  event 
driven  approach  of  NuSim  may  be  accelerated  by  use  o:  a  faster  tiniprocessor.  or  a  multi¬ 
processor  host,  as  demonstrated  by  [WilST,  Jon86].  For  the  greatest  efficiency  in  simulation, 
a  diTtci  ezecuiion  approach  such  as  the  one  proposed  by  Fujimoto  [FC55]  may  be  used, 
where  the  benchmark  is  compiled  into  code  directly  executable  by  the  host,  instrumentation 
counters  are  inserted  by  the  conapiler  into  the  code  to  measure  performance  for  the  target 
machine. 

4.4  Operation  Count 

In  Prolog,  the  metric  Logical  Inferences  Per  Second  in  units  of  1000  (KLIPS)  is  often  used  for 
measuring  the  performance  of  Prolog  engines.  A  logical  inference  can  be  defined  as  a  Prolog 
function  call,  which  include  ^TSI-PLM  instructions  calis.  executes,  and  escapes  for  Prolog 
buihins.  This  metric  is  quite  inaccurate  since  the  logical  inference  can  not  be  measured 
exactly.  The  amount  of  work  done  by  a  Prolog  function  call  depends  on  the  number  and 
type  of  arguments  in  Prolog.  For  parallel  execution,  the  KLIPS  measurement  has  even  less 
significance.  Multiprocessors  may  do  more  work  but  do  not  necessarily  achieve  the  final 
result  any  faster,  if  the  additional  computations  do  not  contribute  directly  to  the  result. 


Table  3:  Looical  Iv  ference  Covni 


NS 

NS 

NS 

VP 

y? 

\’P 

NS/VP 

Benchmark 

calls 

escapes 

KLIPS 

calls 

escapes 

KLIPS 

KLIPS 

biniree 

t  1 

151 

177 

128 

101 

232 

0.76 

compiler  Jjintree 

15113 

7186 

102 

208S6 

2539 

106 

0.96 

compiler  4)lml 

42597 

22318 

122 

67060 

3992 

118 

1.03 

hanoi 

767 

765 

129 

1022 

511 

194 

0.67 

mumath 

1211 

82 

106 

1221 

73 

134 

0.79 

newchat 

66905 

60 

89 

66911 

55 

97 

0.92 

nrevl 

497 

2 

171 

497 

3 

236 

0.72 

palin25 

22S 

97 

121 

‘  323 

3 

130 

0.93 

puzzle 

19796 

6018 

10 

21800 

4015 

t 

1.50 

(js4 

381 

231 

144 

610 

3 

142 

1.02 

qs4jneta 

2694 

720 

84 

3795 

3 

109 

0.77 

queensS 

76457 

151736 

111 

228009 

185 

115 

0.96 

reducer 

15091 

6305 

79 

18815 

2491 

84 

0.94 

sdda 

552 

408 

99 

715 

249 

113 

0.87 

tak 

63609 

111317 

195 

174924 

3 

186 

1.05 

coni 

4 

2 

79 

A 

t 

3 

273 

0.29 

6 

30 

181 

6 

31 

283 

0.64 

fibo 

15 

23 

118 

36 

3 

175 

0.68 

Table  3  shows  the  number  of  normal  calls /executes  and  Prolog  builtin  invocations  (or 
escapes).  Since  VPsim  does  calls  to  library  routines  for  some  of  the  builtins,  it  has  a  much 
higher  calls  count  and  fewer  escape  count  than  NuSim.  In  order  for  KLIPS  to  be  a  useful 
measure,  the  condition  Scaii  ’r  K Sceapc  ^  ^  Peaii-r^'Pttcapt  should  hold  true.  The  following 
results  show  that  this  condition  does  not  hold,  due  to  the  implementation  ^'ariations  of 
NuSim  and  A'Tsim  (described  in  section  3.3). 


Each  of  the  KLIPS  columns  is  calculated  by 

calh  -f  escapes 
cycles 


*  10000 


where  cycles  is  obtained  from  Table  2.  The  unit  for  calls  and  escapes  is  the  logical  inference. 
The  constant  factor  of  10000  comes  from  the  KLIPS  unit  conversion: 


1  KLIP  = 


10®  nsec  1  cvch  1  K 


1000 


1  sec  100  nsec 

The  KS  KLIPS  and  VP  KLIPS  columns  differ  widely,  showing  once  again  the  problem  with 
this  metric.  For  comparison  purpose,  the  timing  information  in  table  2  is  much  more  useful 
than  this  metric. 
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4.5  Memory  Accesses 


Table  4  compares  the  number  of  memory  accesses  made  in  running  the  simulations  on  KuSim 
and  the  ^  Psim.  VP  toi.reff  gives  the  total  count  of  memory  references  to  give  a  sense  of  the 
order  of  magnitude  of  memory  accesses,  which  range  from  about  100  to  over  12  million.  The 
next  4  columns  show  the  ratios  of  accesses  between  KuSim  and  \'Psim  for  total  references, 
instruction  fetches,  reads,  and  writes. 


Table  4:  Memory  Rtfcrcnct!- 


Benchmark 

yp 

total-refs 

.NS/\-? 

refs 

yp 

ifetch 

NS /  VP 
ifetcli 

\'P 

reads 

NS/VP 

read."; 

VP 

writes 

NS/ VP 
writes 

bintree 

oGOl 

1.19 

2527 

1.03 

156S 

1.60 

1506 

0.62 

compiler.biiuree 

125977S 

1,07 

4704G4 

1.16 

426110 

1.06 

363204 

0.93 

compiler  .phnl 

3511904 

0.9G 

138079G 

0.95 

1161595 

1.04 

969513 

O.SS 

hanoi 

51S11 

1.3S 

21441 

1.C5 

13776 

1.26 

16594 

1.12 

mumath 

53052 

1.29 

1625S 

1.7S 

18639 

1.03 

16155 

1.03 

uewchat 

3695155 

1.16 

1376937 

1.50 

1156506 

0.95 

1159712 

0.97 

urevi 

S473 

1.51 

4612 

1.97 

2017 

0.81 

1644 

l.OC 

paliE25 

12759 

1.10 

5695 

1.31 

4114 

0.S9 

2950 

0.99 

puzzle 

1160044C 

O.Sl 

772251 

1.59 

949SG54 

0.72 

1330541 

0.9S 

qs4 

24302 

Q.93 

11241 

1.04 

5509 

0.67 

7652 

0.79 

qs4jaaeta 

197469 

1.13 

70542 

1.42 

61671 

0.97 

65256 

0.96 

queensS 

12354397 

1.09 

5220239 

i.25 

4248614 

0.99 

2SSc<544 

0.96 

reducer 

13C705S 

1.14 

4C2255 

1.46 

60 / 144 

0.99 

397659 

0.97 

sdda 

4S313 

1.13 

17631 

1.33 

16752 

1.06 

i  0 1 30 

0.95 

tal; 

5S7923S 

0.63 

3291760 

0.66 

1033643 

1.16 

1653S35 

0.96 

coni 

54 

2. II 

55 

2.07 

22 

conG 

499 

1.5S 

163 

i.64 

166 

1.04 

nbo 

1207 

1.10 

64S 

1.13 

iB9 

344 

0.95 

We  observe  the  follo'wing: 


•  KuS;m  fetches  instructions  on  demand,  while  V'Psim  does  prefetching.  NuSim 
instructions  are  encoded  in  word  streams,  with  the  opcode  and  each  operand 
taHng  up  one  32-bit  word.  ^*Psim  has  the  code  stored  in  string  tables,  but 
the  microcode  generates  prefetch  signals  to  simulate  an  encoding  of  S-bit 
opcode  and  32-bit  arguments. 

•  The  total  reference  ratios  ai'e  for  most  benchmarks  are  about  1.  The  big  vari¬ 
ations  are  for  coni  (2.11).  con6  (1.58),  and  nrevl  (1.51).  The  variations  are 
perfect  examples  of  worst  case  performance  ■without  cdr-coding  (in  NuSim), 
which  would  reouire  more  instruction  fetches,  reads  and  ■writes.  For  the  larger 
benchmarks,  cdr-coding  makes  little  difierence. 

•  The  ifctch  ratios  show  that  the  word-encoding  of  NuSim  require  more  fetches, 
as  expected.  However,  for  toe,  NuSim  fetches  much  less  (ifetch  ratio  of  0.66) 


because  mary  subtractions  are  done  £ind  N’uSim  use  the  buikin  instruction 
is/2,  while  VPsim  does  a  call  lo  the  library  routine  sub/S  which  require  a 
longer  sequence  of  simpler  instructions. 


5  Discussion 


Simulation  is  an  important  part  of  system  integration.  In  this  paper,  we  have  shown  a 
methodologt’  for  validating  the  simulator  of  a  multiprocessor  system.  We  applied  this  scheme 
to  validate  the  processor  and  the  memory  module  of  a  multiprocessor  simulator  (NuSim) 
by  comparing  it  with  a  previously  validated  uniprocessor  simulator  (\'Psim).  Benchmarks 
of  various  sizes  were  executed  sequential}}'  on  both  simulators,  and  different  performance 
measurements  were  evaluated  and  compared  against  one  tmother. 

Because  the  simulation  result  is  a  composite  result  of  many  factors,  w'e  chose  a  num¬ 
ber  of  measurements  for  comparison  to  obtain  different  perspectives  on  performance  and  to 
understand  the  reasons  of  the  variations.  The  chosen  measurements  were:  code  size,  cycle 
count,  simulation  overhead,  operation  count,  and  memory  access  counts.  The  different  mea¬ 
surements  indicate  that  the  variations  are  significant  only  for  the  small  benchmarks,  where 
startup  time  and  slight  model  differences  are  a  big  percentage  of  total  execution  time.  For 
large  programs.  NuSim  is  within  10%  of  the  \'‘LSI-PLM  timing.  Perhaps  more  important!}’, 
all  variations  can  be  accounted  for.  We  can  thus  conclude  that  NuSim  is  representative  of  a 
^■*LSI•PLM  in  a  multiprocessor  system.  With  NuSim,  we  can  continue  our  study  of  imple- 
mentable  multiprocessor  systems  for  paraiiel  execution  of  nttmeric  and  symbolic  programs, 
using  logic  programming.  We  also  believe  that  the  chosen  measurements  can  be  used  in 
validating  other  simulation  systems. 
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